Project

Profile

Help

Issue #1461

Add the call to remove_unit_duplicate_nevra once optimised

Added by jcline@redhat.com over 5 years ago. Updated about 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.8.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

In pulp_rpm's sync, the remove_unit_duplicate_nevra was temporarily removed in https://github.com/pulp/pulp_rpm/pull/756 because of serious performance problems (simply adding ~10,000 RPM units to Mongo took ~3 hours). The call was occurring after each RPM was added and resulted in a very slow MongoDB query. It should be optimized in some way.


Related issues

Has duplicate Pulp - Issue #1744: Duplicate RPMs in pulp can't be installedCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision 58cfa008 View on GitHub
Added by semyers over 5 years ago

Re-institute purging duplicate NEVRA after an rpm repo sync

https://pulp.plan.io/issues/1461 fixes #1461

The aggregation technique is better in all respects than mapreduce, save one: mapreduce works in mongo 2.4, and thus el6. If, by chance, we get to use mongo 2.6+ exclusively, the mapreduce method should be removed. For now, it is guarded by a mongo version check.

History

#1 Updated by jcline@redhat.com over 5 years ago

  • Description updated (diff)

#2 Updated by mhrivnak over 5 years ago

  • Triaged changed from No to Yes

#3 Updated by jcline@redhat.com over 5 years ago

  • Project changed from Pulp to RPM Support
  • Triaged changed from Yes to No

#4 Updated by jcline@redhat.com over 5 years ago

  • Triaged changed from No to Yes

#5 Updated by mhrivnak over 5 years ago

Searching units in a repo is a somewhat expensive operation, and doing it once for each unit caused n**2 performance degredation. The likely solution to this will be:

- during sync, keep an in-memory collection that tracks the id and nevra of each rpm that gets added
- at the end, do one search for units that match those nevra but do not match the unit IDs, and remove the results

An alternative to searching based on unit IDs is to simply search for unit associations that were created before the sync operation started. This would likely perform better.

#6 Updated by semyers over 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to semyers

#7 Updated by semyers over 5 years ago

After getting pulp to actually sync all of rawhide (48,000ish packages, on_demand download policy for the win), I tried a few different approaches at efficiently chewing on large unit collections (units_rpm, in this case). The winning solution is using a mongo aggregation pipeline to "pre-filter" a given content unit type's (limited at the moment to RPM, SRPM, and DRPM, like the previous implementation) potential duplicate packages, and then cross-reference those potential duplicates with a given repository to find (and purge) stale entries in the RepositoryContentUnit association collection.

The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR, because it's a little bit nuts:
https://github.com/pulp/pulp_rpm/compare/master...seandst:1461-purge-duplicat-nevra

#8 Updated by semyers over 5 years ago

semyers wrote:

The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR...

...and then all the serializer backward-incompatible stuff flared up on platform. I'm back on this issue now. :)

#9 Updated by semyers over 5 years ago

  • Status changed from ASSIGNED to POST

POST!
https://github.com/pulp/pulp_rpm/pull/783

Here's one torture test that I used to load pulp up with duplicate nevra:

for x in `seq 4`
do
  pulp-admin rpm repo create --feed "https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-rawhide&arch=x86_64" --download-policy on_demand --repo-id rawhide-$x --relative-url rawhide-$x
  pulp-admin rpm repo sync run --repo-id rawhide-$x &
done

Make sure you've got lazy downloading enabled.

#10 Updated by semyers over 5 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#11 Updated by dkliban@redhat.com over 5 years ago

  • Status changed from MODIFIED to 5

#12 Updated by mhrivnak over 5 years ago

  • Has duplicate Issue #1744: Duplicate RPMs in pulp can't be installed added

#13 Updated by dkliban@redhat.com about 5 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#14 Updated by bmbouter about 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF