Issue #1949
closedre-publish takes longer than expected
Description
Users have see that a re-publish of the following Red Hat repository can take upwards of 35 minutes. This is very surprising, since it only has 213 rpms, and 163 errata.
/content/dist/rhel/server/7/7Server/x86_64/extras/os/
The problem has also been reported with this repo:
/content/dist/rhel/server/7/7Server/x86_64/rhscl/1/os/
On my laptop running F23, trying with pulp 2.8 and 2.9 (obviously alpha), I'm seeing publish times of about 15s for the "extras" repo. But others have been able to reproduce on Fedora.
Files
Related issues
Updated by mhrivnak over 8 years ago
Once triaged, this should go on the current sprint so we can at a minimum identify exactly where the time is being spent, and identify options for improvement.
Updated by jcline@redhat.com over 8 years ago
- File Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_sync_stats Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_sync_stats added
- File Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_publish_stats Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_publish_stats added
- File el7-extras_publish_stats el7-extras_publish_stats added
- File scl7_publish_stats scl7_publish_stats added
From my brief analysis it's because of the work for https://pulp.plan.io/issues/1548
Updated by jcline@redhat.com over 8 years ago
This is probably a duplicate of https://pulp.plan.io/issues/1947
Updated by mhrivnak over 8 years ago
- Has duplicate Issue #1947: Concurrent sync of repos is running slow added
Updated by bmbouter over 8 years ago
- Sprint/Milestone set to 21
Per Comment 1 I'm putting this on the current Sprint. If possible, the fix should be introduced in the 2.8.z release stream.
Updated by pthomas@redhat.com over 8 years ago
I have a tried to to re sync a rhel6 repo which has --skp rpm and takes over an hour to publish errata. it is taking roughly 10s per errata
And cpu usage hovers around 100%
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3799 apache 20 0 905m 95m 4136 R 98.5 2.0 11:54.38 python
4080 root 20 0 271m 22m 6460 S 1.9 0.5 0:11.51 pulp-admin
23551 mongodb 20 0 3229m 548m 237m S 1.6 11.3 6:55.11 mongod
3186 apache 20 0 1100m 59m 9112 S 1.2 1.2 0:08.21 httpd
Also worth noting that on el7 the same sync completed just fine
[root@mgmt8 ~]# time pulp-admin rpm repo sync run --repo-id rhel6
+----------------------------------------------------------------------+
Synchronizing Repository [rhel6]
+----------------------------------------------------------------------+
This command may be exited via ctrl+c without affecting the request.
Downloading metadata...
[\]
... completed
Downloading repository content...
[-]
[==================================================] 100%
RPMs: 0/0 items
Delta RPMs: 0/0 items
... completed
Downloading distribution files...
[==================================================] 100%
Distributions: 0/0 items
... completed
Importing errata...
[-]
... completed
Importing package groups/categories...
[-]
... completed
Cleaning duplicate packages...
[-]
... completed
Task Succeeded
Copying files
[-]
... completed
Initializing repo metadata
[-]
... completed
Publishing Distribution files
[-]
... completed
Publishing RPMs
[-]
... completed
Publishing Delta RPMs
... skipped
Publishing Errata
[==================================================] 100%
3325 of 3325 items
... completed
Publishing Comps file
[==================================================] 100%
212 of 212 items
... completed
Publishing Metadata.
[-]
... completed
Closing repo metadata
[-]
... completed
Generating sqlite files
... skipped
Publishing files to web
[-]
... completed
Writing Listings File
[-]
... completed
Task Succeeded
real 2m56.243s
user 0m7.112s
sys 0m0.276s
[root@mgmt8 ~]#
Updated by semyers over 8 years ago
I think it might be worth going over the history of how we got here. The first change was when we started concatenating errata package lists, which was necessary due to an incorrect assumption that errata with the same id would not appear in multiple repos. You'd sync down el7 and its errata, and presumably those errata link to packages in the synced repo. Then you'd sync el6 and its errata, and the el6 package list would replace the el7 package list. Now those errata only link to el7 packages, but when the el7 repo is republished by pulp, it's linked to errata that now reference el6 packages, and so it publishes el6 errata into an el7 repo. To fix this, we started concatenating package lists from multiple repos.
Here's the change, and the related bugzilla:
https://github.com/pulp/pulp_rpm/pull/625
https://bugzilla.redhat.com/show_bug.cgi?id=1171278
For a while, everything seemed to be working, but then reports started coming in about errata referencing packages that aren't in a published repo (because in our example above, the updateinfo package list is now naming el6 packages in a published el7 repo):
https://pulp.plan.io/issues/1366
https://pulp.plan.io/issues/1548
My solution to this, seen in 1548, is what most likely causes the slowness, which is to go through the concatenated errata package lists and filter out packages not in the repo being published.
Another errata-related issue, having to do with the syncing of errata metadata even a repo has already been sync, isn't really related to this at first glance, but it might hold some valuable information to assist in solving this issue. Starting at comment 7, a discussion between ttereshc and me, with input from jluza, reveals some very useful details about what can and cannot be relied upon in the errata data:
https://pulp.plan.io/issues/858#note-7
ttereshc's fix here was to modify package list short names to include the pulp repo_id, ensuring that even if the errata unit is shared among multiple repos, we at least have a way to make those packagelist names unique per-repo. This might even give us a way to link packagelists back to a repo by repo_id, but I'm not sure about the reliability of this approach, since it might be based on parsing a string to pull the repo_id out.
I have an idea about doing this reliably, which is basically to iterate over repo_ids and package list names, returning the package list who's short name .endswith() the longest match repo_id match, and pulling out the package list for the repo being published. I can elaborate on this if needed, but if the logic works, there's still the problem that it would only work on repos synced since ttereshc's change made it into pulp, so we likely still need my slow solution from #1548 to stick around if a package list can't be associated with the repo being published.
Updated by semyers over 8 years ago
Related to my previous comment: #858 has a target release of 2.8.5, which is being released imminently.
Updated by semyers over 8 years ago
I also forgot to mention that #1989 is one way that I think we can solve this.
Updated by bmbouter over 8 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bmbouter
Updated by mhrivnak over 8 years ago
Assuming the performance problem is in querying the database, here are some quick thoughts that may or may not be helpful, depending on what is revealed by further investigation:
The problem boils down to: for each RPM listed in an errata package list, pulp needs to know if it is in the repository, and if not, leaves it out of the published xml.
One option is to do one big query of the DB to get all of the NEVRA that are in the repo, and store them in a python set. That could be very quickly queried while iterating through the errata. Sync already does something similar to do dependency resolution, so the memory impact should be tolerable.
Something less heavy-handed would be to do a first pass through the errata to collect all the NEVRA referenced by them, then do one big query for just those nevra, and then do a second pass iterating the errata with the knowledge of which RPMs are in the repo. This would have a lower memory footprint.
Updated by bmbouter over 8 years ago
I tested the new, quicker implementation to ensure it was omitting NEVRA from other repos using:
# create two repos
pulp-admin rpm repo create --repo-id rhel6 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/6/6Server/x86_64/os/ --download-policy on_demand
pulp-admin rpm repo create --repo-id rhel7 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/7/7Server/x86_64/os/ --download-policy on_demand
# sync two repos
pulp-admin rpm sync publish run --repo-id rhel6
pulp-admin rpm sync publish run --repo-id rhel7 # <----- this is where I checked the filtering to ensure that rhel6 rpms were filtered out.
# For example RHSA-2014:1293 and RHSA-2016:1217 both refer to RPMs in EL6 and EL7.
Updated by bmbouter over 8 years ago
For the following repo with 213 rpms, and 163 errata, the Errata publish step was taking about 32 seconds. With the new implementation is takes < 2.
/content/dist/rhel/server/7/7Server/x86_64/extras/os/
Updated by semyers over 8 years ago
bmbouter wrote:
For the following repo with 213 rpms, and 163 errata, the Errata publish step was taking about 32 seconds. With the new implementation is takes < 2.
/content/dist/rhel/server/7/7Server/x86_64/extras/os/
If you're able to do before-and-after profiling and have a way to get memory numbers, I think those numbers might look good, too.
Updated by bmbouter over 8 years ago
The new design stores all of a repo's rpm nevra as a list of named tuples in memory. For the 213 RPMs this is stored as 1936 bytes as reported by sys.getsizeof(). This variable is populated when the PublishErrataStep is initialized and garbage collected after the publish is complete.
>>> sys.getsizeof(repo_nevra)
Out[7]: 1936
>>> len(repo_nevra)
Out[8]: 213
Updated by bmbouter over 8 years ago
- Status changed from ASSIGNED to POST
PR available at https://github.com/pulp/pulp_rpm/pull/916
Added by bmbouter over 8 years ago
Updated by bmbouter over 8 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset 500341f6d8694dacfbc06e62549232923abd77cc.
Updated by pthomas@redhat.com over 8 years ago
- Status changed from MODIFIED to 6
Verified
1. Updated from 2.8.5 to the nightly to make sure that republish is not taking long
2. Verified #16
3. Reverified https://pulp.plan.io/issues/1548
Updated by semyers over 8 years ago
- Status changed from 5 to 6
I accidentally moved this from VERIFIED back to ON_QA. Sorry for the ticket noise!
Updated by semyers over 8 years ago
- Status changed from 6 to CLOSED - CURRENTRELEASE
Updated by semyers over 8 years ago
- Related to Task #2083: Issues common to 2.9.1 and 2.8 stream added
Erratum publish reads repo nevra into memory
The Errata publish performance became a problem recently due to lots of db calls being made as Errata metadata is filtered to only refer to packages in this specific repo.
This PR causes all the repo's NEVRA to be stored in memory once and all filtering to work out of that data structure. This avoids searching the db everytime for the NEVRA specified by the erratum so it does much less work than before.
https://pulp.plan.io/issues/1949 fixes #1949