Pulp publishes invalid PULP_DISTRIBUTION.xml metadata
If a repository contains a PULP_DISTRIBUTION.xml metadata file, it is possible for Pulp to re-publish it with invalid data. This causes a second Pulp server syncing from the first to fail. Specifically, files are referenced in the PULP_DISTRIBUTION.xml file that do no exist in the version published by Pulp (but do exist upstream).
For example, the RHEL6 kickstart repository contains a PULP_DISTRIBUTION.xml file that references `repodata/productid`. During sync this is downloaded along with the XML file, but when the repository is published, it is explicitly skipped.
Ultimately, this occurs because Pulp blindly syncs and publishes this PULP_DISTRIBUTION.xml file while filtering content retrieved using it.
To fix this, we should be generating/altering the PULP_DISTRIBUTION.xml file we publish to ensure we don't create invalid metadata. However, a bigger question is whether or not filtering content is even appropriate. I suspect it is not. This issue is not meant to address that problem, though.
Regenerate PULP_DISTRIBUTION.xml on publish if necessary
The PULP_DISTRIBUTION.xml file used to be saved from an upstream repository and republished without modification. This is problematic because files referenced by that file are filtered out during a publish. This commit is a short-term work-around to that problematic workflow. Without it, Pulp (or anything else using PULP_DISTRIBUTION.xml) will attempt to download files that don't exist in the published repository.
#3 Updated by email@example.com over 5 years ago
- Subject changed from Pulp-to-pulp distribution syncing is almost certainly broken in some cases to Pulp publishes invalid PULP_DISTRIBUTION.xml metadata
- Description updated (diff)
- Status changed from NEW to ASSIGNED
- Assignee set to firstname.lastname@example.org
I've re-written the issue to narrow the focus, since the original was very broad. There are already several known issues with distributions (issue #1768 which was only a very short-term fix and doesn't address the incorrect modeling and #1769 which describes content we fail to mirror).
I intend to ensure Pulp doesn't publish metadata that references files that doesn't exist. However, it may be that it won't reference files that need to exist. I don't know what is using (or not using) `repodata/productid` and I find it troubling that we don't mirror upstream, but I don't think I should to tackle all the problems we have as part of this issue.
#4 Updated by mhrivnak over 5 years ago
A simple work-around that would improve, but not fix the situation, would be to do the same filtering during sync that we do during publish. Then at least pulp deployments with that change would happily ignore the same files that publish ignores.
As you point out, a better option is to modify the XML at publish time to filter out any files that don't actually get published. This would be more effort, but is still very doable.
And of course the best option would require figuring out why exactly pulp ignores those files, document that somewhere (at least in the code if not elsewhere), and determine if skipping those files is in fact appropriate.
To unblock katello, perhaps a combination of the first two would be valuable. You could probably make a PR for the first work-around very quickly, and then follow with the second option shortly thereafter. That would buy us time to further investigate why pulp is doing this at all. What do you think of that?
#10 Updated by email@example.com over 5 years ago
- Status changed from 5 to 6
[root@ibm-x3250m4-03 ~]# pulp-admin rpm repo sync run --repo-id rhel6 +----------------------------------------------------------------------+ Synchronizing Repository [rhel6] +----------------------------------------------------------------------+ This command may be exited via ctrl+c without affecting the request. Downloading metadata... [|] ... completed Downloading repository content... [-] [==================================================] 100% RPMs: 0/0 items Delta RPMs: 0/0 items ... completed Downloading distribution files... [==================================================] 100% Distributions: 0/0 items ... completed Importing errata... [-] ... completed Importing package groups/categories... [-] ... completed Cleaning duplicate packages... [-] ... completed Task Succeeded Copying files [-] ... completed Initializing repo metadata [-] ... completed Publishing Distribution files [|] ... completed Publishing RPMs [/] ... completed Publishing Delta RPMs ... skipped Publishing Errata [-] ... completed Publishing Comps file [==================================================] 100% 212 of 212 items ... completed Publishing Metadata. [-] ... completed Closing repo metadata [-] ... completed Generating sqlite files ... skipped Publishing files to web [\] ... completed Writing Listings File [-] ... completed Writing Listings File [-] ... completed Task Succeeded
Please register to edit this issue