Project

Profile

Help

Issue #1949

closed

re-publish takes longer than expected

Added by mhrivnak over 8 years ago. Updated over 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
Platform Release:
2.8.6
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 4
Quarter:

Description

Users have see that a re-publish of the following Red Hat repository can take upwards of 35 minutes. This is very surprising, since it only has 213 rpms, and 163 errata.

/content/dist/rhel/server/7/7Server/x86_64/extras/os/

The problem has also been reported with this repo:

/content/dist/rhel/server/7/7Server/x86_64/rhscl/1/os/

On my laptop running F23, trying with pulp 2.8 and 2.9 (obviously alpha), I'm seeing publish times of about 15s for the "extras" repo. But others have been able to reproduce on Fedora.


Files


Related issues

Related to Packaging - Task #2083: Issues common to 2.9.1 and 2.8 streamCLOSED - NOTABUGsemyers

Actions
Has duplicate Pulp - Issue #1947: Concurrent sync of repos is running slowCLOSED - DUPLICATEActions
Actions #1

Updated by mhrivnak over 8 years ago

Once triaged, this should go on the current sprint so we can at a minimum identify exactly where the time is being spent, and identify options for improvement.

Actions #3

Updated by jcline@redhat.com over 8 years ago

This is probably a duplicate of https://pulp.plan.io/issues/1947

Actions #4

Updated by bmbouter over 8 years ago

  • Triaged changed from No to Yes
Actions #5

Updated by mhrivnak over 8 years ago

  • Has duplicate Issue #1947: Concurrent sync of repos is running slow added
Actions #6

Updated by bmbouter over 8 years ago

  • Sprint/Milestone set to 21

Per Comment 1 I'm putting this on the current Sprint. If possible, the fix should be introduced in the 2.8.z release stream.

Actions #8

Updated by pthomas@redhat.com over 8 years ago

I have a tried to to re sync a rhel6 repo which has --skp rpm and takes over an hour to publish errata. it is taking roughly 10s per errata

And cpu usage hovers around 100%

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 3799 apache    20   0  905m  95m 4136 R 98.5  2.0  11:54.38 python             
 4080 root      20   0  271m  22m 6460 S  1.9  0.5   0:11.51 pulp-admin         
23551 mongodb   20   0 3229m 548m 237m S  1.6 11.3   6:55.11 mongod             
 3186 apache    20   0 1100m  59m 9112 S  1.2  1.2   0:08.21 httpd   

Also worth noting that on el7 the same sync completed just fine
[root@mgmt8 ~]# time pulp-admin rpm repo sync run --repo-id rhel6
+----------------------------------------------------------------------+
                    Synchronizing Repository [rhel6]
+----------------------------------------------------------------------+

This command may be exited via ctrl+c without affecting the request.

Downloading metadata...
[\]
... completed

Downloading repository content...
[-]
[==================================================] 100%
RPMs:       0/0 items
Delta RPMs: 0/0 items

... completed

Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed

Importing errata...
[-]
... completed

Importing package groups/categories...
[-]
... completed

Cleaning duplicate packages...
[-]
... completed

Task Succeeded

Copying files
[-]
... completed

Initializing repo metadata
[-]
... completed

Publishing Distribution files
[-]
... completed

Publishing RPMs
[-]
... completed

Publishing Delta RPMs
... skipped

Publishing Errata
[==================================================] 100%
3325 of 3325 items
... completed

Publishing Comps file
[==================================================] 100%
212 of 212 items
... completed

Publishing Metadata.
[-]
... completed

Closing repo metadata
[-]
... completed

Generating sqlite files
... skipped

Publishing files to web
[-]
... completed

Writing Listings File
[-]
... completed

Task Succeeded

real    2m56.243s
user    0m7.112s
sys    0m0.276s
[root@mgmt8 ~]# 


Actions #9

Updated by mhrivnak over 8 years ago

  • Sprint/Milestone changed from 21 to 22
Actions #10

Updated by semyers over 8 years ago

I think it might be worth going over the history of how we got here. The first change was when we started concatenating errata package lists, which was necessary due to an incorrect assumption that errata with the same id would not appear in multiple repos. You'd sync down el7 and its errata, and presumably those errata link to packages in the synced repo. Then you'd sync el6 and its errata, and the el6 package list would replace the el7 package list. Now those errata only link to el7 packages, but when the el7 repo is republished by pulp, it's linked to errata that now reference el6 packages, and so it publishes el6 errata into an el7 repo. To fix this, we started concatenating package lists from multiple repos.

Here's the change, and the related bugzilla:
https://github.com/pulp/pulp_rpm/pull/625
https://bugzilla.redhat.com/show_bug.cgi?id=1171278

For a while, everything seemed to be working, but then reports started coming in about errata referencing packages that aren't in a published repo (because in our example above, the updateinfo package list is now naming el6 packages in a published el7 repo):

https://pulp.plan.io/issues/1366
https://pulp.plan.io/issues/1548

My solution to this, seen in 1548, is what most likely causes the slowness, which is to go through the concatenated errata package lists and filter out packages not in the repo being published.

Another errata-related issue, having to do with the syncing of errata metadata even a repo has already been sync, isn't really related to this at first glance, but it might hold some valuable information to assist in solving this issue. Starting at comment 7, a discussion between ttereshc and me, with input from jluza, reveals some very useful details about what can and cannot be relied upon in the errata data:

https://pulp.plan.io/issues/858#note-7

ttereshc's fix here was to modify package list short names to include the pulp repo_id, ensuring that even if the errata unit is shared among multiple repos, we at least have a way to make those packagelist names unique per-repo. This might even give us a way to link packagelists back to a repo by repo_id, but I'm not sure about the reliability of this approach, since it might be based on parsing a string to pull the repo_id out.

I have an idea about doing this reliably, which is basically to iterate over repo_ids and package list names, returning the package list who's short name .endswith() the longest match repo_id match, and pulling out the package list for the repo being published. I can elaborate on this if needed, but if the logic works, there's still the problem that it would only work on repos synced since ttereshc's change made it into pulp, so we likely still need my slow solution from #1548 to stick around if a package list can't be associated with the repo being published.

Actions #11

Updated by semyers over 8 years ago

Related to my previous comment: #858 has a target release of 2.8.5, which is being released imminently.

Actions #12

Updated by semyers over 8 years ago

I also forgot to mention that #1989 is one way that I think we can solve this.

Actions #13

Updated by bmbouter over 8 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
Actions #14

Updated by mhrivnak over 8 years ago

Assuming the performance problem is in querying the database, here are some quick thoughts that may or may not be helpful, depending on what is revealed by further investigation:

The problem boils down to: for each RPM listed in an errata package list, pulp needs to know if it is in the repository, and if not, leaves it out of the published xml.

One option is to do one big query of the DB to get all of the NEVRA that are in the repo, and store them in a python set. That could be very quickly queried while iterating through the errata. Sync already does something similar to do dependency resolution, so the memory impact should be tolerable.

Something less heavy-handed would be to do a first pass through the errata to collect all the NEVRA referenced by them, then do one big query for just those nevra, and then do a second pass iterating the errata with the knowledge of which RPMs are in the repo. This would have a lower memory footprint.

Actions #16

Updated by bmbouter over 8 years ago

I tested the new, quicker implementation to ensure it was omitting NEVRA from other repos using:

# create two repos
pulp-admin rpm repo create --repo-id rhel6 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/6/6Server/x86_64/os/ --download-policy on_demand
pulp-admin rpm repo create --repo-id rhel7 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/7/7Server/x86_64/os/ --download-policy on_demand

# sync two repos
pulp-admin rpm sync publish run --repo-id rhel6
pulp-admin rpm sync publish run --repo-id rhel7     #  <----- this is where I checked the filtering to ensure that rhel6 rpms were filtered out.

# For example RHSA-2014:1293 and RHSA-2016:1217 both refer to RPMs in EL6 and EL7.
Actions #18

Updated by bmbouter over 8 years ago

For the following repo with 213 rpms, and 163 errata, the Errata publish step was taking about 32 seconds. With the new implementation is takes < 2.

/content/dist/rhel/server/7/7Server/x86_64/extras/os/

Actions #19

Updated by semyers over 8 years ago

bmbouter wrote:

For the following repo with 213 rpms, and 163 errata, the Errata publish step was taking about 32 seconds. With the new implementation is takes < 2.

/content/dist/rhel/server/7/7Server/x86_64/extras/os/

If you're able to do before-and-after profiling and have a way to get memory numbers, I think those numbers might look good, too.

Actions #20

Updated by bmbouter over 8 years ago

The new design stores all of a repo's rpm nevra as a list of named tuples in memory. For the 213 RPMs this is stored as 1936 bytes as reported by sys.getsizeof(). This variable is populated when the PublishErrataStep is initialized and garbage collected after the publish is complete.

>>> sys.getsizeof(repo_nevra)
Out[7]: 1936
>>> len(repo_nevra)
Out[8]: 213
Actions #21

Updated by bmbouter over 8 years ago

  • Status changed from ASSIGNED to POST

Added by bmbouter over 8 years ago

Revision 500341f6 | View on GitHub

Erratum publish reads repo nevra into memory

The Errata publish performance became a problem recently due to lots of db calls being made as Errata metadata is filtered to only refer to packages in this specific repo.

This PR causes all the repo's NEVRA to be stored in memory once and all filtering to work out of that data structure. This avoids searching the db everytime for the NEVRA specified by the erratum so it does much less work than before.

https://pulp.plan.io/issues/1949 fixes #1949

Actions #22

Updated by bmbouter over 8 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #23

Updated by bmbouter over 8 years ago

  • Platform Release set to 2.8.6
Actions #24

Updated by pthomas@redhat.com over 8 years ago

  • Status changed from MODIFIED to 6

Verified

1. Updated from 2.8.5 to the nightly to make sure that republish is not taking long

2. Verified #16

3. Reverified https://pulp.plan.io/issues/1548

Actions #25

Updated by semyers over 8 years ago

  • Status changed from 6 to 5
Actions #26

Updated by semyers over 8 years ago

  • Status changed from 5 to 6

I accidentally moved this from VERIFIED back to ON_QA. Sorry for the ticket noise!

Actions #27

Updated by semyers over 8 years ago

  • Status changed from 6 to CLOSED - CURRENTRELEASE
Actions #28

Updated by semyers over 8 years ago

  • Related to Task #2083: Issues common to 2.9.1 and 2.8 stream added
Actions #30

Updated by bmbouter over 6 years ago

  • Sprint set to Sprint 4
Actions #31

Updated by bmbouter over 6 years ago

  • Sprint/Milestone deleted (22)
Actions #32

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF