Issue #9542

Repositories with packages in common (same hash) redownload the file even if it is already in the artifact folder

Added by ajsween about 1 month ago. Updated 4 days ago.

Start date:
Due date:
Estimated time:
2. Medium
Platform Release:
Sprint Candidate:
Sprint 110


Given that the workers are using the remote repository metadata which contains package hashes to inform the sync process, pulp_rpm should check if the package already exists locally as an artifact. If it does, skip the artifact download and create a content object with a reference to the existing artifact that will be associated to the repository being sync'd. This saves on downloading the same file over and over during syncs.

I validated this by setting up two remotes and two repositories. Both remotes point to the same remote repository (Appstream) and the local repositories point to each of the remotes respectively. After I finished syncing one of the repositories using the "immediate" policy, I setup a tcpdump to monitor GET requests: tcpdump -i <interface_name> -s 0 -A 'tcp[((tcp12:1] & 0xf0) >> 2):4] = 0x47455420' Next I began the sync of the second repository. I then watched as the GET requests showed the second sync was clearly redownloading every file.

This makes little sense given that the artifacts are stored as files named after their SHA256 hash (minus the first two characters which are used as the folder name). In other words, there can only be one blob per unique file hash.

Related issues

Related to Pulp - Backport #9584: Backport 9542 to pulpcore 3.15NEW

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Has duplicate Pulp - Issue #9552: Syncs are downloading every artifact, every timeCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>


#1 Updated by ttereshc 23 days ago

  • Project changed from RPM Support to Pulp

Could you share your pulpcore and pulp_rpm version that you are using?
Could you confirm that the first repo sync finished successfully without any failures?

You expectation is correct that once a package is in pulp, it should not be redownloaded on a subsequent sync, same repo or not.

#2 Updated by ajsween 15 days ago

"versions": [ { "component": "core", "version": "3.16.0" }, { "component": "rpm", "version": "3.16.1" }, { "component": "python", "version": "3.5.2" }, { "component": "file", "version": "1.10.1" }, { "component": "deb", "version": "2.16.0" }, { "component": "container", "version": "2.9.0" }, { "component": "certguard", "version": "1.5.1" }, { "component": "ansible", "version": "0.10.1" } ],

I can confirm syncs were successful and that a version 1 of the repositories were created. My expectation is that once an artifact is created in pulp. that it is not downloaded again, regardless of whether other repositories/remotes/content types have content units that reference the same sha256 (artifact) hash using different metadata.

Currently I am noticing that every time a sync is done between rpm remotes/repositories and the sync optimization determines a a sync is warranted, the tasking redownloads every package into tmp then only updates those that changed. This is incredibly problematic and has made pulp completely unusable as a daily sync job against remote mirrors.

If someone could spend some time helping me get up to speed on the sync process I'd be happy to help troubleshoot the code further. I have built three new pulp deployments to test how it behaves with a brand new database with the same result. Every sync that isn't skipped results in downloading every package. Syncing a new remote/repository aimed at an identical mirror (for instance two different mirrors of Centos8 Appstream) results in downloading everything twice with no consideration of existing artifacts.

I've also written bash scripts to add all artifacts in the artifacts folder of a preexisting pulp deployment to a new pulp deployment as artifact objects. Then followed with adding a remote/repository that has all its referenced artifacts among the artifacts I added manually. This still results in the sync ignoring anything currently existing as an artifact in the database, downloading the entire remote repository to the tmp folder, and only then determining what content units and artifacts need to be added.

Any assistance someone can offer me would be greatly appreciated as this has made syncs of RPM repositories unviable.

#3 Updated by 4 days ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 110

#4 Updated by dalley 4 days ago

  • Has duplicate Issue #9552: Syncs are downloading every artifact, every time added

#6 Updated by bmbouter 4 days ago

I believe we do have an issue here. I used this script below:


set -ev


pulp file remote create --name $remote_name --url $url
pulp file repository create --name $repo_name --remote $remote_name
pulp file repository sync --name $repo_name


pulp file remote create --name $remote_name --url $url
pulp file repository create --name $repo_name --remote $remote_name
pulp file repository sync --name $repo_name

And added this diff to pulpcore to make it easy to monitor the downloads occuring:

diff --git a/pulpcore/download/ b/pulpcore/download/
index 725a9c8d2..68a7a4149 100644
--- a/pulpcore/download/
+++ b/pulpcore/download/
@@ -156,6 +156,7 @@ class HttpDownloader(BaseDownloader):
             kwargs (dict): This accepts the parameters of
+        log.warning(f"Downloading url: {url}")
         if session:
             self.session = session
             self._close_session_on_finalize = False

And when I run the script I see:

pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [dcf3453f2444431ba47fb3d19524c5fd]: Downloading url:
pulp [dcf3453f2444431ba47fb3d19524c5fd]: Downloading url:
pulp [dcf3453f2444431ba47fb3d19524c5fd]: Downloading url:

I expected only to see:

pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: Downloading url:
pulp [dcf3453f2444431ba47fb3d19524c5fd]: Downloading url:

What's interesting is that in running it over and over (with a full reset in between) I never see 1.iso redownloaded, but I always see 2.iso and 3.iso.

#7 Updated by bmbouter 4 days ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter

#8 Updated by pulpbot 4 days ago

  • Status changed from ASSIGNED to POST

#9 Updated by mdellweg 1 day ago

Please register to edit this issue

Also available in: Atom PDF