Project

Profile

Help

Issue #1461

closed

Add the call to remove_unit_duplicate_nevra once optimised

Added by jcline@redhat.com about 9 years ago. Updated over 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.8.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

In pulp_rpm's sync, the remove_unit_duplicate_nevra was temporarily removed in https://github.com/pulp/pulp_rpm/pull/756 because of serious performance problems (simply adding ~10,000 RPM units to Mongo took ~3 hours). The call was occurring after each RPM was added and resulted in a very slow MongoDB query. It should be optimized in some way.


Related issues

Has duplicate Pulp - Issue #1744: Duplicate RPMs in pulp can't be installedCLOSED - DUPLICATEActions
Actions #1

Updated by jcline@redhat.com about 9 years ago

  • Description updated (diff)
Actions #2

Updated by mhrivnak about 9 years ago

  • Triaged changed from No to Yes
Actions #3

Updated by jcline@redhat.com about 9 years ago

  • Project changed from Pulp to RPM Support
  • Triaged changed from Yes to No
Actions #4

Updated by jcline@redhat.com about 9 years ago

  • Triaged changed from No to Yes
Actions #5

Updated by mhrivnak about 9 years ago

Searching units in a repo is a somewhat expensive operation, and doing it once for each unit caused n**2 performance degredation. The likely solution to this will be:

- during sync, keep an in-memory collection that tracks the id and nevra of each rpm that gets added
- at the end, do one search for units that match those nevra but do not match the unit IDs, and remove the results

An alternative to searching based on unit IDs is to simply search for unit associations that were created before the sync operation started. This would likely perform better.

Actions #6

Updated by semyers almost 9 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to semyers
Actions #7

Updated by semyers almost 9 years ago

After getting pulp to actually sync all of rawhide (48,000ish packages, on_demand download policy for the win), I tried a few different approaches at efficiently chewing on large unit collections (units_rpm, in this case). The winning solution is using a mongo aggregation pipeline to "pre-filter" a given content unit type's (limited at the moment to RPM, SRPM, and DRPM, like the previous implementation) potential duplicate packages, and then cross-reference those potential duplicates with a given repository to find (and purge) stale entries in the RepositoryContentUnit association collection.

The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR, because it's a little bit nuts:
https://github.com/pulp/pulp_rpm/compare/master...seandst:1461-purge-duplicat-nevra

Actions #8

Updated by semyers almost 9 years ago

semyers wrote:

The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR...

...and then all the serializer backward-incompatible stuff flared up on platform. I'm back on this issue now. :)

Actions #9

Updated by semyers almost 9 years ago

  • Status changed from ASSIGNED to POST

POST!
https://github.com/pulp/pulp_rpm/pull/783

Here's one torture test that I used to load pulp up with duplicate nevra:

for x in `seq 4`
do
  pulp-admin rpm repo create --feed "https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-rawhide&arch=x86_64" --download-policy on_demand --repo-id rawhide-$x --relative-url rawhide-$x
  pulp-admin rpm repo sync run --repo-id rawhide-$x &
done

Make sure you've got lazy downloading enabled.

Added by semyers almost 9 years ago

Revision 58cfa008 | View on GitHub

Re-institute purging duplicate NEVRA after an rpm repo sync

https://pulp.plan.io/issues/1461 fixes #1461

The aggregation technique is better in all respects than mapreduce, save one: mapreduce works in mongo 2.4, and thus el6. If, by chance, we get to use mongo 2.6+ exclusively, the mapreduce method should be removed. For now, it is guarded by a mongo version check.

Actions #10

Updated by semyers almost 9 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #11

Updated by dkliban@redhat.com almost 9 years ago

  • Status changed from MODIFIED to 5
Actions #12

Updated by mhrivnak almost 9 years ago

  • Has duplicate Issue #1744: Duplicate RPMs in pulp can't be installed added
Actions #13

Updated by dkliban@redhat.com almost 9 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #14

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF