Issue #1461
closed
Add the call to remove_unit_duplicate_nevra once optimised
Status:
CLOSED - CURRENTRELEASE
Description
In pulp_rpm's sync, the remove_unit_duplicate_nevra was temporarily removed in https://github.com/pulp/pulp_rpm/pull/756 because of serious performance problems (simply adding ~10,000 RPM units to Mongo took ~3 hours). The call was occurring after each RPM was added and resulted in a very slow MongoDB query. It should be optimized in some way.
- Description updated (diff)
- Triaged changed from No to Yes
- Project changed from Pulp to RPM Support
- Triaged changed from Yes to No
- Triaged changed from No to Yes
Searching units in a repo is a somewhat expensive operation, and doing it once for each unit caused n**2 performance degredation. The likely solution to this will be:
- during sync, keep an in-memory collection that tracks the id and nevra of each rpm that gets added
- at the end, do one search for units that match those nevra but do not match the unit IDs, and remove the results
An alternative to searching based on unit IDs is to simply search for unit associations that were created before the sync operation started. This would likely perform better.
- Status changed from NEW to ASSIGNED
- Assignee set to semyers
After getting pulp to actually sync all of rawhide (48,000ish packages, on_demand download policy for the win), I tried a few different approaches at efficiently chewing on large unit collections (units_rpm, in this case). The winning solution is using a mongo aggregation pipeline to "pre-filter" a given content unit type's (limited at the moment to RPM, SRPM, and DRPM, like the previous implementation) potential duplicate packages, and then cross-reference those potential duplicates with a given repository to find (and purge) stale entries in the RepositoryContentUnit association collection.
The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR, because it's a little bit nuts:
https://github.com/pulp/pulp_rpm/compare/master...seandst:1461-purge-duplicat-nevra
semyers wrote:
The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR...
...and then all the serializer backward-incompatible stuff flared up on platform. I'm back on this issue now. :)
- Status changed from ASSIGNED to POST
POST!
https://github.com/pulp/pulp_rpm/pull/783
Here's one torture test that I used to load pulp up with duplicate nevra:
for x in `seq 4`
do
pulp-admin rpm repo create --feed "https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-rawhide&arch=x86_64" --download-policy on_demand --repo-id rawhide-$x --relative-url rawhide-$x
pulp-admin rpm repo sync run --repo-id rawhide-$x &
done
Make sure you've got lazy downloading enabled.
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
- Status changed from MODIFIED to 5
- Has duplicate Issue #1744: Duplicate RPMs in pulp can't be installed added
- Status changed from 5 to CLOSED - CURRENTRELEASE
Also available in: Atom
PDF
Re-institute purging duplicate NEVRA after an rpm repo sync
https://pulp.plan.io/issues/1461 fixes #1461
The aggregation technique is better in all respects than mapreduce, save one: mapreduce works in mongo 2.4, and thus el6. If, by chance, we get to use mongo 2.6+ exclusively, the mapreduce method should be removed. For now, it is guarded by a mongo version check.