Refactor #5701
closedPerformance improvement in remote duplicates
100%
Description
The current implementation of the "Remove Duplicates" functionality is probably lacking in efficiency. It looks like this:
query_for_repo_duplicates_by_type = defaultdict(lambda: Q())
for item in repository_version.added():
detail_item = item.cast()
if detail_item.repo_key_fields == ():
continue
unit_q_dict = {
field: getattr(detail_item, field) for field in detail_item.repo_key_fields
}
item_query = Q(**unit_q_dict) & ~Q(pk=detail_item.pk)
query_for_repo_duplicates_by_type[detail_item._meta.model] |= item_query
for model in query_for_repo_duplicates_by_type:
_logger.debug(_("Removing duplicates for type: {}".format(model)))
qs = model.objects.filter(query_for_repo_duplicates_by_type[model])
repository_version.remove_content(qs)
While I haven't measured the exact impact, the individual item.cast() for each item is probably quite expensive. What would likely improve the situation is one of the following:
Proposal 1:
- Sort these into groups based on their pulp_type which is present on the master Content model.
- Look up the detail content models that represent the pulp_type strings
- Query the detail content models directly, in bulk, provided a list of PKs, instead of cast() individually
- Then within each type group check for duplicates
Proposal 2:
Alternately, each repository can list all of the content types it supports, which would allow us to skip item 2 above (maybe item 1 also) and would allow us to provide an extra layer of protection around making sure you can't have e.g. file content in an RPM repository which we can't easily or centrally guarantee otherwise.
Related issues
Updated by dalley about 5 years ago
- Tracker changed from Refactor to Issue
- Severity set to 2. Medium
- Triaged set to No
Updated by fao89 about 5 years ago
- Tracker changed from Issue to Refactor
- % Done set to 0
Updated by fao89 about 5 years ago
- Related to Issue #5688: Large memory consumption when syncing RHEL 7 os x86_64 added
Updated by dalley about 5 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
Updated by dalley about 5 years ago
- Status changed from ASSIGNED to POST
- Sprint set to Sprint 62
https://github.com/pulp/pulpcore/pull/441
https://github.com/pulp/pulp_file/pull/329
https://github.com/pulp/pulp_rpm/pull/1549
https://github.com/pulp/pulp_python/pull/264
https://github.com/pulp/pulp_maven/pull/27
https://github.com/pulp/pulp_container/pull/25
https://github.com/pulp/pulp_ansible/pull/264
Added by dalley about 5 years ago
Added by dalley about 5 years ago
Revision bfc50d61 | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
Added by dalley about 5 years ago
Revision 8acbb8a0 | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
Added by dalley about 5 years ago
Revision 3caded8c | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
Added by dalley about 5 years ago
Revision 3caded8c | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
Added by dalley about 5 years ago
Revision 1af38824 | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
Added by dalley about 5 years ago
Revision 140d8495 | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
Added by dalley about 5 years ago
Revision 0604a7b4 | View on GitHub
Add additional content type verification to RepositoryVersion
Added by dalley about 5 years ago
Revision 1861257b | View on GitHub
Improve performance of removing duplicates
Updated by dalley about 5 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulpcore|1861257bee238b91488101c1ae257484fa48ab87.
Added by dalley about 5 years ago
Revision f4f190c2 | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
re: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit 140d849544fa2467ba30db16e6b5163274a4b3d0)
Added by dalley about 5 years ago
Revision cc2fb46a | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
re: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit 3caded8c2e6a5d3be5be218713741e7e4ae515ed)
Added by dalley about 5 years ago
Revision cc2fb46a | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
re: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit 3caded8c2e6a5d3be5be218713741e7e4ae515ed)
Added by dalley about 5 years ago
Revision 6b2af564 | View on GitHub
Add additional content type verification to RepositoryVersion
re: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit 0604a7b41cc5c394b9769d13f31fd0744d6d0ca7)
Added by dalley about 5 years ago
Revision 66dc8749 | View on GitHub
Improve performance of removing duplicates
closes: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit 1861257bee238b91488101c1ae257484fa48ab87)
Added by dalley about 5 years ago
Revision 32c1d08f | View on GitHub
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
re: #5701 https://pulp.plan.io/issues/5701 (cherry picked from commit fb3cda3c7314d0d7c5bb3e0c1e440808959be191)
Updated by bmbouter about 5 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Add CONTENT_TYPES to repo definition
Required PR: https://github.com/pulp/pulpcore/pull/441
re: #5701 https://pulp.plan.io/issues/5701