Issue #9542
closedRepositories with packages in common (same hash) redownload the file even if it is already in the artifact folder
Description
Given that the workers are using the remote repository metadata which contains package hashes to inform the sync process, pulp_rpm should check if the package already exists locally as an artifact. If it does, skip the artifact download and create a content object with a reference to the existing artifact that will be associated to the repository being sync'd. This saves on downloading the same file over and over during syncs.
I validated this by setting up two remotes and two repositories. Both remotes point to the same remote repository (Appstream) and the local repositories point to each of the remotes respectively. After I finished syncing one of the repositories using the "immediate" policy, I setup a tcpdump to monitor GET requests:
tcpdump -i <interface_name> -s 0 -A 'tcp[((tcp12:1] & 0xf0) >> 2):4] = 0x47455420'
Next I began the sync of the second repository. I then watched as the GET requests showed the second sync was clearly redownloading every file.
This makes little sense given that the artifacts are stored as files named after their SHA256 hash (minus the first two characters which are used as the folder name). In other words, there can only be one blob per unique file hash.
Related issues
Updated by ttereshc about 3 years ago
- Project changed from RPM Support to Pulp
Could you share your pulpcore and pulp_rpm version that you are using?
Could you confirm that the first repo sync finished successfully without any failures?
You expectation is correct that once a package is in pulp, it should not be redownloaded on a subsequent sync, same repo or not.
Updated by ajsween about 3 years ago
"versions": [ { "component": "core", "version": "3.16.0" }, { "component": "rpm", "version": "3.16.1" }, { "component": "python", "version": "3.5.2" }, { "component": "file", "version": "1.10.1" }, { "component": "deb", "version": "2.16.0" }, { "component": "container", "version": "2.9.0" }, { "component": "certguard", "version": "1.5.1" }, { "component": "ansible", "version": "0.10.1" } ],
I can confirm syncs were successful and that a version 1 of the repositories were created. My expectation is that once an artifact is created in pulp. that it is not downloaded again, regardless of whether other repositories/remotes/content types have content units that reference the same sha256 (artifact) hash using different metadata.
Currently I am noticing that every time a sync is done between rpm remotes/repositories and the sync optimization determines a a sync is warranted, the tasking redownloads every package into tmp then only updates those that changed. This is incredibly problematic and has made pulp completely unusable as a daily sync job against remote mirrors.
If someone could spend some time helping me get up to speed on the sync process I'd be happy to help troubleshoot the code further. I have built three new pulp deployments to test how it behaves with a brand new database with the same result. Every sync that isn't skipped results in downloading every package. Syncing a new remote/repository aimed at an identical mirror (for instance two different mirrors of Centos8 Appstream) results in downloading everything twice with no consideration of existing artifacts.
I've also written bash scripts to add all artifacts in the artifacts folder of a preexisting pulp deployment to a new pulp deployment as artifact objects. Then followed with adding a remote/repository that has all its referenced artifacts among the artifacts I added manually. This still results in the sync ignoring anything currently existing as an artifact in the database, downloading the entire remote repository to the tmp folder, and only then determining what content units and artifacts need to be added.
Any assistance someone can offer me would be greatly appreciated as this has made syncs of RPM repositories unviable.
Updated by dkliban@redhat.com almost 3 years ago
- Triaged changed from No to Yes
- Sprint set to Sprint 110
Updated by dalley almost 3 years ago
- Has duplicate Issue #9552: Syncs are downloading every artifact, every time added
Updated by dalley almost 3 years ago
Updated by bmbouter almost 3 years ago
I believe we do have an issue here. I used this script below:
#!/bin/bash
set -ev
repo_name="repo$RANDOM"
remote_name="remote$RANDOM"
url="https://fixtures.pulpproject.org/file/PULP_MANIFEST"
pulp file remote create --name $remote_name --url $url
pulp file repository create --name $repo_name --remote $remote_name
pulp file repository sync --name $repo_name
repo_name="repo$RANDOM"
remote_name="remote$RANDOM"
pulp file remote create --name $remote_name --url $url
pulp file repository create --name $repo_name --remote $remote_name
pulp file repository sync --name $repo_name
And added this diff to pulpcore to make it easy to monitor the downloads occuring:
diff --git a/pulpcore/download/http.py b/pulpcore/download/http.py
index 725a9c8d2..68a7a4149 100644
--- a/pulpcore/download/http.py
+++ b/pulpcore/download/http.py
@@ -156,6 +156,7 @@ class HttpDownloader(BaseDownloader):
kwargs (dict): This accepts the parameters of
:class:`~pulpcore.plugin.download.BaseDownloader`.
"""
+ log.warning(f"Downloading url: {url}")
if session:
self.session = session
self._close_session_on_finalize = False
And when I run the script I see:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/1.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/2.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/3.iso
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/2.iso
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/3.iso
I expected only to see:
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/1.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/2.iso
pulp [2b50c00cf8c54f2983b7b7cdd066a522]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/3.iso
pulp [dcf3453f2444431ba47fb3d19524c5fd]: pulpcore.download.http:WARNING: Downloading url: https://fixtures.pulpproject.org/file/PULP_MANIFEST
What's interesting is that in running it over and over (with a full reset in between) I never see 1.iso
redownloaded, but I always see 2.iso
and 3.iso
.
Updated by bmbouter almost 3 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bmbouter
Updated by pulpbot almost 3 years ago
- Status changed from ASSIGNED to POST
Updated by mdellweg almost 3 years ago
- Related to Backport #9584: Backport 9542 to pulpcore 3.15 added
Added by mdellweg almost 3 years ago
Updated by mdellweg almost 3 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulpcore|a9431560785c5b16c27708928a3dd10763401899.
Updated by jsherril@redhat.com almost 3 years ago
- Related to Backport #9596: backport Issue #9542 to pulpcore 3.15 added
Updated by pulpbot almost 3 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Fixes Artifact redownloading bug
The
sync_to_async_iterable
wraps the Artifact queryset, but unlike querysets, it can't be reused. This causes subsequent iterations through it to not actually iterate.closes #9542