Add the call to remove_unit_duplicate_nevra once optimised
In pulp_rpm's sync, the remove_unit_duplicate_nevra was temporarily removed in https://github.com/pulp/pulp_rpm/pull/756 because of serious performance problems (simply adding ~10,000 RPM units to Mongo took ~3 hours). The call was occurring after each RPM was added and resulted in a very slow MongoDB query. It should be optimized in some way.
Re-institute purging duplicate NEVRA after an rpm repo sync
The aggregation technique is better in all respects than mapreduce, save one: mapreduce works in mongo 2.4, and thus el6. If, by chance, we get to use mongo 2.6+ exclusively, the mapreduce method should be removed. For now, it is guarded by a mongo version check.
#5 Updated by mhrivnak over 5 years ago
Searching units in a repo is a somewhat expensive operation, and doing it once for each unit caused n**2 performance degredation. The likely solution to this will be:
- during sync, keep an in-memory collection that tracks the id and nevra of each rpm that gets added
- at the end, do one search for units that match those nevra but do not match the unit IDs, and remove the results
An alternative to searching based on unit IDs is to simply search for unit associations that were created before the sync operation started. This would likely perform better.
#7 Updated by semyers over 5 years ago
After getting pulp to actually sync all of rawhide (48,000ish packages, on_demand download policy for the win), I tried a few different approaches at efficiently chewing on large unit collections (units_rpm, in this case). The winning solution is using a mongo aggregation pipeline to "pre-filter" a given content unit type's (limited at the moment to RPM, SRPM, and DRPM, like the previous implementation) potential duplicate packages, and then cross-reference those potential duplicates with a given repository to find (and purge) stale entries in the RepositoryContentUnit association collection.
The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR, because it's a little bit nuts:
#9 Updated by semyers over 5 years ago
- Status changed from ASSIGNED to POST
Here's one torture test that I used to load pulp up with duplicate nevra:
for x in `seq 4` do pulp-admin rpm repo create --feed "https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-rawhide&arch=x86_64" --download-policy on_demand --repo-id rawhide-$x --relative-url rawhide-$x pulp-admin rpm repo sync run --repo-id rawhide-$x & done
Make sure you've got lazy downloading enabled.
Please register to edit this issue