re-publish takes longer than expected
Users have see that a re-publish of the following Red Hat repository can take upwards of 35 minutes. This is very surprising, since it only has 213 rpms, and 163 errata.
The problem has also been reported with this repo:
On my laptop running F23, trying with pulp 2.8 and 2.9 (obviously alpha), I'm seeing publish times of about 15s for the "extras" repo. But others have been able to reproduce on Fedora.
Erratum publish reads repo nevra into memory
The Errata publish performance became a problem recently due to lots of db calls being made as Errata metadata is filtered to only refer to packages in this specific repo.
This PR causes all the repo's NEVRA to be stored in memory once and all filtering to work out of that data structure. This avoids searching the db everytime for the NEVRA specified by the erratum so it does much less work than before.
#2 Updated by firstname.lastname@example.org over 5 years ago
- File Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_sync_stats Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_sync_stats added
- File Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_publish_stats Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_-_Extras_RPMs_x86_64_publish_stats added
- File el7-extras_publish_stats el7-extras_publish_stats added
- File scl7_publish_stats scl7_publish_stats added
From my brief analysis it's because of the work for https://pulp.plan.io/issues/1548
#8 Updated by email@example.com over 5 years ago
I have a tried to to re sync a rhel6 repo which has --skp rpm and takes over an hour to publish errata. it is taking roughly 10s per errata And cpu usage hovers around 100% PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3799 apache 20 0 905m 95m 4136 R 98.5 2.0 11:54.38 python 4080 root 20 0 271m 22m 6460 S 1.9 0.5 0:11.51 pulp-admin 23551 mongodb 20 0 3229m 548m 237m S 1.6 11.3 6:55.11 mongod 3186 apache 20 0 1100m 59m 9112 S 1.2 1.2 0:08.21 httpd Also worth noting that on el7 the same sync completed just fine [root@mgmt8 ~]# time pulp-admin rpm repo sync run --repo-id rhel6 +----------------------------------------------------------------------+ Synchronizing Repository [rhel6] +----------------------------------------------------------------------+ This command may be exited via ctrl+c without affecting the request. Downloading metadata... [\] ... completed Downloading repository content... [-] [==================================================] 100% RPMs: 0/0 items Delta RPMs: 0/0 items ... completed Downloading distribution files... [==================================================] 100% Distributions: 0/0 items ... completed Importing errata... [-] ... completed Importing package groups/categories... [-] ... completed Cleaning duplicate packages... [-] ... completed Task Succeeded Copying files [-] ... completed Initializing repo metadata [-] ... completed Publishing Distribution files [-] ... completed Publishing RPMs [-] ... completed Publishing Delta RPMs ... skipped Publishing Errata [==================================================] 100% 3325 of 3325 items ... completed Publishing Comps file [==================================================] 100% 212 of 212 items ... completed Publishing Metadata. [-] ... completed Closing repo metadata [-] ... completed Generating sqlite files ... skipped Publishing files to web [-] ... completed Writing Listings File [-] ... completed Task Succeeded real 2m56.243s user 0m7.112s sys 0m0.276s [root@mgmt8 ~]#
#10 Updated by semyers over 5 years ago
I think it might be worth going over the history of how we got here. The first change was when we started concatenating errata package lists, which was necessary due to an incorrect assumption that errata with the same id would not appear in multiple repos. You'd sync down el7 and its errata, and presumably those errata link to packages in the synced repo. Then you'd sync el6 and its errata, and the el6 package list would replace the el7 package list. Now those errata only link to el7 packages, but when the el7 repo is republished by pulp, it's linked to errata that now reference el6 packages, and so it publishes el6 errata into an el7 repo. To fix this, we started concatenating package lists from multiple repos.
Here's the change, and the related bugzilla:
For a while, everything seemed to be working, but then reports started coming in about errata referencing packages that aren't in a published repo (because in our example above, the updateinfo package list is now naming el6 packages in a published el7 repo):
My solution to this, seen in 1548, is what most likely causes the slowness, which is to go through the concatenated errata package lists and filter out packages not in the repo being published.
Another errata-related issue, having to do with the syncing of errata metadata even a repo has already been sync, isn't really related to this at first glance, but it might hold some valuable information to assist in solving this issue. Starting at comment 7, a discussion between ttereshc and me, with input from jluza, reveals some very useful details about what can and cannot be relied upon in the errata data:
ttereshc's fix here was to modify package list short names to include the pulp repo_id, ensuring that even if the errata unit is shared among multiple repos, we at least have a way to make those packagelist names unique per-repo. This might even give us a way to link packagelists back to a repo by repo_id, but I'm not sure about the reliability of this approach, since it might be based on parsing a string to pull the repo_id out.
I have an idea about doing this reliably, which is basically to iterate over repo_ids and package list names, returning the package list who's short name .endswith() the longest match repo_id match, and pulling out the package list for the repo being published. I can elaborate on this if needed, but if the logic works, there's still the problem that it would only work on repos synced since ttereshc's change made it into pulp, so we likely still need my slow solution from #1548 to stick around if a package list can't be associated with the repo being published.
#14 Updated by mhrivnak over 5 years ago
Assuming the performance problem is in querying the database, here are some quick thoughts that may or may not be helpful, depending on what is revealed by further investigation:
The problem boils down to: for each RPM listed in an errata package list, pulp needs to know if it is in the repository, and if not, leaves it out of the published xml.
One option is to do one big query of the DB to get all of the NEVRA that are in the repo, and store them in a python set. That could be very quickly queried while iterating through the errata. Sync already does something similar to do dependency resolution, so the memory impact should be tolerable.
Something less heavy-handed would be to do a first pass through the errata to collect all the NEVRA referenced by them, then do one big query for just those nevra, and then do a second pass iterating the errata with the knowledge of which RPMs are in the repo. This would have a lower memory footprint.
#16 Updated by bmbouter over 5 years ago
I tested the new, quicker implementation to ensure it was omitting NEVRA from other repos using:
# create two repos pulp-admin rpm repo create --repo-id rhel6 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/6/6Server/x86_64/os/ --download-policy on_demand pulp-admin rpm repo create --repo-id rhel7 --feed https://YOURCDNHOSTNAME/content/dist/rhel/server/7/7Server/x86_64/os/ --download-policy on_demand # sync two repos pulp-admin rpm sync publish run --repo-id rhel6 pulp-admin rpm sync publish run --repo-id rhel7 # <----- this is where I checked the filtering to ensure that rhel6 rpms were filtered out. # For example RHSA-2014:1293 and RHSA-2016:1217 both refer to RPMs in EL6 and EL7.
#19 Updated by semyers over 5 years ago
For the following repo with 213 rpms, and 163 errata, the Errata publish step was taking about 32 seconds. With the new implementation is takes < 2.
If you're able to do before-and-after profiling and have a way to get memory numbers, I think those numbers might look good, too.
#20 Updated by bmbouter over 5 years ago
The new design stores all of a repo's rpm nevra as a list of named tuples in memory. For the 213 RPMs this is stored as 1936 bytes as reported by sys.getsizeof(). This variable is populated when the PublishErrataStep is initialized and garbage collected after the publish is complete.
>>> sys.getsizeof(repo_nevra) Out: 1936 >>> len(repo_nevra) Out: 213
Please register to edit this issue