Issue #2457
closedWhen syncing do not associate units that are already associated to the repo
Description
I synced an el6 repo, where first sync too 1h 15 mins:
To download metadata 1 min
To generate db file 4 mins
To determine what to download 4 mins
To actually download the content 66 mins
To download addition units 2 mins
I re-synced same repo after i removed couple of rpms:
To download metadata 1 min
To generate db file 4 mins
To determine what to download 7 mins
To actually download the content 4mins (the ones I removed)
To download addition units 2 mins
After some investigation it was clear that step "determine what to download" takes the most time 7 mins
Half of this time is spent on metadata file handling, here we cannot do anything about that.
The other half of the time is spent here, where we check whether the unit that we want is present on the filesystem already.
https://github.com/pulp/pulp_rpm/blob/2.8-dev/plugins/pulp_rpm/plugins/importers/yum/existing.py#L92
We could work on time optimization in this part of the code, and at least not associate units that are already associated to the repo
and not to add them to the catalog because they are already there.
Another place where we could do same improvements is during the step "download addition units" ( like errata, comps, yumrepometadata file)
Related issues
Reduce number of writes to db during sync
This commit eliminates the following unnecessary operations:
save()
to errata model even when no new collections were addedcloses #2457 https://pulp.plan.io/issues/2457