Issue #1461: Add the call to remove_unit_duplicate_nevra once optimised - RPM Support - Pulp

Actions

Send by e-mail Copy link

Issue #1461

closed

Add the call to remove_unit_duplicate_nevra once optimised

Added by jcline@redhat.com about 9 years ago. Updated almost 6 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

High

Assignee:

semyers

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

2.8.0

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Quarter:

Description

In pulp_rpm's sync, the remove_unit_duplicate_nevra was temporarily removed in https://github.com/pulp/pulp_rpm/pull/756 because of serious performance problems (simply adding ~10,000 RPM units to Mongo took ~3 hours). The call was occurring after each RPM was added and resulted in a very slow MongoDB query. It should be optimized in some way.

Related issues

Actions

Copy link

Updated by jcline@redhat.com about 9 years ago

Description updated (diff)

Actions

Copy link

Updated by mhrivnak about 9 years ago

Triaged changed from No to Yes

Actions

Copy link

Updated by jcline@redhat.com about 9 years ago

Project changed from Pulp to RPM Support
Triaged changed from Yes to No

Actions

Copy link

Updated by jcline@redhat.com about 9 years ago

Triaged changed from No to Yes

Actions

Copy link

Updated by mhrivnak about 9 years ago

Searching units in a repo is a somewhat expensive operation, and doing it once for each unit caused n**2 performance degredation. The likely solution to this will be:

- during sync, keep an in-memory collection that tracks the id and nevra of each rpm that gets added
- at the end, do one search for units that match those nevra but do not match the unit IDs, and remove the results

An alternative to searching based on unit IDs is to simply search for unit associations that were created before the sync operation started. This would likely perform better.

Actions

Copy link

Updated by semyers about 9 years ago

Status changed from NEW to ASSIGNED
Assignee set to semyers

Actions

Copy link

Updated by semyers about 9 years ago

After getting pulp to actually sync all of rawhide (48,000ish packages, on_demand download policy for the win), I tried a few different approaches at efficiently chewing on large unit collections (units_rpm, in this case). The winning solution is using a mongo aggregation pipeline to "pre-filter" a given content unit type's (limited at the moment to RPM, SRPM, and DRPM, like the previous implementation) potential duplicate packages, and then cross-reference those potential duplicates with a given repository to find (and purge) stale entries in the RepositoryContentUnit association collection.

The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR, because it's a little bit nuts:
https://github.com/pulp/pulp_rpm/compare/master...seandst:1461-purge-duplicat-nevra

Actions

Copy link

Updated by semyers about 9 years ago

semyers wrote:

The branch is pushed here, but I'm going to let it sit for a day and come back to it before putting up a PR...

...and then all the serializer backward-incompatible stuff flared up on platform. I'm back on this issue now. :)

Actions

Copy link

Updated by semyers about 9 years ago

Status changed from ASSIGNED to POST

POST!
https://github.com/pulp/pulp_rpm/pull/783

Here's one torture test that I used to load pulp up with duplicate nevra:

for x in `seq 4`
do
  pulp-admin rpm repo create --feed "https://mirrors.fedoraproject.org/mirrorlist?repo=fedora-rawhide&arch=x86_64" --download-policy on_demand --repo-id rawhide-$x --relative-url rawhide-$x
  pulp-admin rpm repo sync run --repo-id rawhide-$x &
done

Make sure you've got lazy downloading enabled.

Added by semyers almost 9 years ago

Revision 58cfa008 | View on GitHub

Re-institute purging duplicate NEVRA after an rpm repo sync

https://pulp.plan.io/issues/1461 fixes #1461

The aggregation technique is better in all respects than mapreduce, save one: mapreduce works in mongo 2.4, and thus el6. If, by chance, we get to use mongo 2.6+ exclusively, the mapreduce method should be removed. For now, it is guarded by a mongo version check.

Actions

Copy link

#10