Project

Profile

Help

Refactor #5701

closed

Performance improvement in remote duplicates

Added by dalley over 4 years ago. Updated over 4 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 63
Quarter:

Description

The current implementation of the "Remove Duplicates" functionality is probably lacking in efficiency. It looks like this:

query_for_repo_duplicates_by_type = defaultdict(lambda: Q())
for item in repository_version.added():
    detail_item = item.cast()

    if detail_item.repo_key_fields == ():
        continue
    unit_q_dict = {
        field: getattr(detail_item, field) for field in detail_item.repo_key_fields
    }
    item_query = Q(**unit_q_dict) & ~Q(pk=detail_item.pk)
    query_for_repo_duplicates_by_type[detail_item._meta.model] |= item_query

for model in query_for_repo_duplicates_by_type:
    _logger.debug(_("Removing duplicates for type: {}".format(model)))
    qs = model.objects.filter(query_for_repo_duplicates_by_type[model])
    repository_version.remove_content(qs)

While I haven't measured the exact impact, the individual item.cast() for each item is probably quite expensive. What would likely improve the situation is one of the following:

Proposal 1:

  1. Sort these into groups based on their pulp_type which is present on the master Content model.
  2. Look up the detail content models that represent the pulp_type strings
  3. Query the detail content models directly, in bulk, provided a list of PKs, instead of cast() individually
  4. Then within each type group check for duplicates

Proposal 2:

Alternately, each repository can list all of the content types it supports, which would allow us to skip item 2 above (maybe item 1 also) and would allow us to provide an extra layer of protection around making sure you can't have e.g. file content in an RPM repository which we can't easily or centrally guarantee otherwise.


Related issues

Related to RPM Support - Issue #5688: Large memory consumption when syncing RHEL 7 os x86_64CLOSED - CURRENTRELEASEfao89Actions

Also available in: Atom PDF