Project

Profile

Help

Issue #9542

closed

Repositories with packages in common (same hash) redownload the file even if it is already in the artifact folder

Added by ajsween over 2 years ago. Updated over 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 110
Quarter:

Description

Given that the workers are using the remote repository metadata which contains package hashes to inform the sync process, pulp_rpm should check if the package already exists locally as an artifact. If it does, skip the artifact download and create a content object with a reference to the existing artifact that will be associated to the repository being sync'd. This saves on downloading the same file over and over during syncs.

I validated this by setting up two remotes and two repositories. Both remotes point to the same remote repository (Appstream) and the local repositories point to each of the remotes respectively. After I finished syncing one of the repositories using the "immediate" policy, I setup a tcpdump to monitor GET requests: tcpdump -i <interface_name> -s 0 -A 'tcp[((tcp12:1] & 0xf0) >> 2):4] = 0x47455420' Next I began the sync of the second repository. I then watched as the GET requests showed the second sync was clearly redownloading every file.

This makes little sense given that the artifacts are stored as files named after their SHA256 hash (minus the first two characters which are used as the folder name). In other words, there can only be one blob per unique file hash.


Related issues

Related to Pulp - Backport #9584: Backport 9542 to pulpcore 3.15CLOSED - CURRENTRELEASEbmbouter

Actions
Related to Pulp - Backport #9596: backport Issue #9542 to pulpcore 3.15CLOSED - CURRENTRELEASEbmbouter

Actions
Has duplicate Pulp - Issue #9552: Syncs are downloading every artifact, every timeCLOSED - DUPLICATEActions
Actions #1

Updated by ttereshc over 2 years ago

  • Project changed from RPM Support to Pulp

Could you share your pulpcore and pulp_rpm version that you are using?
Could you confirm that the first repo sync finished successfully without any failures?

You expectation is correct that once a package is in pulp, it should not be redownloaded on a subsequent sync, same repo or not.

Actions #2

Updated by ajsween over 2 years ago

"versions": [ { "component": "core", "version": "3.16.0" }, { "component": "rpm", "version": "3.16.1" }, { "component": "python", "version": "3.5.2" }, { "component": "file", "version": "1.10.1" }, { "component": "deb", "version": "2.16.0" }, { "component": "container", "version": "2.9.0" }, { "component": "certguard", "version": "1.5.1" }, { "component": "ansible", "version": "0.10.1" } ],

I can confirm syncs were successful and that a version 1 of the repositories were created. My expectation is that once an artifact is created in pulp. that it is not downloaded again, regardless of whether other repositories/remotes/content types have content units that reference the same sha256 (artifact) hash using different metadata.

Currently I am noticing that every time a sync is done between rpm remotes/repositories and the sync optimization determines a a sync is warranted, the tasking redownloads every package into tmp then only updates those that changed. This is incredibly problematic and has made pulp completely unusable as a daily sync job against remote mirrors.

If someone could spend some time helping me get up to speed on the sync process I'd be happy to help troubleshoot the code further. I have built three new pulp deployments to test how it behaves with a brand new database with the same result. Every sync that isn't skipped results in downloading every package. Syncing a new remote/repository aimed at an identical mirror (for instance two different mirrors of Centos8 Appstream) results in downloading everything twice with no consideration of existing artifacts.

I've also written bash scripts to add all artifacts in the artifacts folder of a preexisting pulp deployment to a new pulp deployment as artifact objects. Then followed with adding a remote/repository that has all its referenced artifacts among the artifacts I added manually. This still results in the sync ignoring anything currently existing as an artifact in the database, downloading the entire remote repository to the tmp folder, and only then determining what content units and artifacts need to be added.

Any assistance someone can offer me would be greatly appreciated as this has made syncs of RPM repositories unviable.

Actions #3

Updated by dkliban@redhat.com over 2 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 110
Actions #4

Updated by dalley over 2 years ago

  • Has duplicate Issue #9552: Syncs are downloading every artifact, every time added
Actions #6

Updated by bmbouter over 2 years ago

I believe we do have an issue here. I used this script below:

#!/bin/bash

set -ev

repo_name="repo$RANDOM"
remote_name="remote$RANDOM"
url="https://fixtures.pulpproject.org/file/PULP_MANIFEST"

pulp file remote create --name $remote_name --url $url
pulp file repository create --name $repo_name --remote $remote_name
pulp file repository sync --name $repo_name

repo_name="repo$RANDOM"
remote_name="remote$RANDOM"

pulp file remote create --name $remote_name --url $url
pulp file repository create --name $repo_name --remote $remote_name
pulp file repository sync --name $repo_name

And added this diff to pulpcore to make it easy to monitor the downloads occuring:

diff --git a/pulpcore/download/http.py b/pulpcore/download/http.py
index 725a9c8d2..68a7a4149 100644
--- a/pulpcore/download/http.py
+++ b/pulpcore/download/http.py
@@ -156,6 +156,7 @@ class HttpDownloader(BaseDownloader):
             kwargs (dict): This accepts the parameters of
                 :class:`~pulpcore.plugin.download.BaseDownloader`.
         """
+        log.warning(f"Downloading url: {url}")
         if session:
             self.session = session
             self._close_session_on_finalize = False

And when I run the script I see:

pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/1.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/2.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/3.iso
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/2.iso
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/3.iso

I expected only to see:

pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/1.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/2.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/3.iso
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST

What's interesting is that in running it over and over (with a full reset in between) I never see 1.iso redownloaded, but I always see 2.iso and 3.iso.

Actions #7

Updated by bmbouter over 2 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
Actions #8

Updated by pulpbot over 2 years ago

  • Status changed from ASSIGNED to POST
Actions #9

Updated by mdellweg over 2 years ago

Added by mdellweg over 2 years ago

Revision a9431560 | View on GitHub

Fixes Artifact redownloading bug

The sync_to_async_iterable wraps the Artifact queryset, but unlike querysets, it can't be reused. This causes subsequent iterations through it to not actually iterate.

closes #9542

Actions #10

Updated by mdellweg over 2 years ago

  • Status changed from POST to MODIFIED
Actions #11

Updated by jsherril@redhat.com over 2 years ago

  • Related to Backport #9596: backport Issue #9542 to pulpcore 3.15 added
Actions #12

Updated by pulpbot over 2 years ago

  • Sprint/Milestone set to 3.17.0
Actions #13

Updated by pulpbot over 2 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF