Project

Profile

Help

Issue #4219

Lazy syncing a repo with different metadata checksum types fails

Added by daviddavis about 1 year ago. Updated 8 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Severity:
2. Medium
Version:
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

Syncing a repository that has sha1 in repomd.xml, but sha256 in primary.xml fails using on_demand.

Steps to reproduce:
1. Find/create a repo with sha1 checksums in repomd.xml, but sha256 in primary.xml
2. Create a repo in pulp using on_demand
3. Sync and observe the following error:

Traceback (most recent call last):\n" +
  File \"/usr/lib/python2.7/site-packages/celery/app/trace.py\", line 367, in trace_task\n" +
    R = retval = fun(*args, **kwargs)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py\", line 529, in __call__\n" +
    return super(Task, self).__call__(*args, **kwargs)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py\", line 107, in __call__\n" +
    return super(PulpTask, self).__call__(*args, **kwargs)\n" +
  File \"/usr/lib/python2.7/site-packages/celery/app/trace.py\", line 622, in __protected_call__\n" +
    return self.run(*args, **kwargs)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/server/controllers/repository.py\", line 1109, in publish\n" +
    result = check_publish(repo_obj, dist_id, dist_inst, transfer_repo, conduit, call_config)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/server/controllers/repository.py\", line 1206, in check_publish\n" +
    result = _do_publish(repo_obj, dist_id, dist_inst, transfer_repo, conduit, call_config)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/server/controllers/repository.py\", line 1258, in _do_publish\n" +
    publish_report = publish_repo(transfer_repo, conduit, call_config)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py\", line 737, in wrap_f\n" +
    return f(*args, **kwargs)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp_rpm/plugins/distributors/yum/distributor.py\", line 174, in publish_repo\n" +
    return self._publisher.process_lifecycle()\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/plugins/util/publish_step.py\", line 572, in process_lifecycle\n" +
    super(PluginStep, self).process_lifecycle()\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/plugins/util/publish_step.py\", line 163, in process_lifecycle\n" +
    step.process()\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/plugins/util/publish_step.py\", line 239, in process\n" +
    self._process_block(item=item)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp/plugins/util/publish_step.py\", line 301, in _process_block\n" +
    self.process_main(item=item)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp_rpm/plugins/distributors/yum/publish.py\", line 499, in process_main\n" +
    context.add_unit_metadata(unit)\n" +
  File \"/usr/lib/python2.7/site-packages/pulp_rpm/plugins/distributors/yum/metadata/filelists.py\", line 42, in add_unit_metadata\n" +
    self.metadata_file_handle.write(unit.render_filelists(self.checksum_type))\n" +
  File \"/usr/lib/python2.7/site-packages/pulp_rpm/plugins/db/models.py\", line 868, in render_filelists\n" +
    context = Context({'pkgid': self.get_or_calculate_and_save_checksum(checksumtype)})\n" +
  File \"/usr/lib/python2.7/site-packages/pulp_rpm/plugins/db/models.py\", line 258, in get_or_calculate_and_save_checksum\n" +
    checksumtype=checksumtype)\n" +
PulpCodedException: Checksum type \"sha1\" is not available for all units in the repository. Make sure those units have been downloaded.\n",

It looks like we compare the unit checksumtype against the publish checksumtype being passed in here:

https://github.com/pulp/pulp_rpm/blob/2-master/plugins/pulp_rpm/plugins/db/models.py#L254

This checksum type comes from the repo scratchpad which is populated here:

https://github.com/pulp/pulp_rpm/blob/2-master/plugins/pulp_rpm/plugins/importers/yum/sync.py#L513-L533

It’s just grabbing the first item from the first metadata file to determine the repo’s checksum type. In this case, it’s sha1 while the checksum type on the package is sha256.

History

#2 Updated by daviddavis about 1 year ago

  • Description updated (diff)

#3 Updated by daviddavis about 1 year ago

  • Project changed from Pulp to RPM Support

#4 Updated by CodeHeeler about 1 year ago

  • Triaged changed from No to Yes

#5 Updated by daviddavis about 1 year ago

Note: run any solution by @jsherrill first in case it impacts Katello.

#6 Updated by daviddavis 12 months ago

Background

After doing a little bit of research, I think that there are two checksum types in a repository. First, there is the checksum type of the repository metadata files like primary.xml, updateinfo.xml, filelists.xml, etc. Checksums of this type are listed in repomd.xml.

Then there is also the checksum type of the RPM packages which is listed in the primary.xml file. I think there's a guarantee (and someone can correct me if I am wrong) that all packages in a repository have the same checksum type.

Problem

The problem is that Pulp conflates these two checksum types by storing a checksum type for the entire repository. In the issue we're dealing with, it's storing the checksum type of the repository metadata files (sha1) and trying to publish packages (with sha256 checksums) using that checksum type. But of course for lazy synced repos, the checksum type of the packages doesn't match and it can't compute the new checksum type.

Potential solutions

Given this background, I think we have two options to solve this bug:

1. Store two checksum types: one for repomd files and one for packages. This would also allow us to generate repositories that have two different checksum types and these checksum types would match the upstream repository.
2. Continue to store a single checksum type for the repo but instead of pulling the checksum type from the first metadata file, we could instead pull the checksum type from the first package entry in primary.xml and store it on the repository. This would generate a repository with a single checksum type even though the upstream repo might have a different checksum type for the repomd files. I think this would work because I believe we generate these repo metadata files and thus we can compute their checksums using the checksum type of the RPM packages.

Option 2 would be significantly less work I think but it would also be less flexible.

#7 Updated by jsherril@redhat.com 12 months ago

 I think there's a guarantee (and someone can correct me if I am wrong) that all packages in a repository have the same checksum type.

There's nothing enforcing this at all. A repository could contain packages of mixed content types.

#8 Updated by jsherril@redhat.com 12 months ago

I think we should revisit why this option even exists. I believe it is for two reasons:

1. To ensure that el5 systems can access repositories with sha1 checksums (for repomd.xml and primary.xml). This situation could happen again if a stronger checksum type is introduced that el6 or 7 does not support.
2. So that a user may choose to publish a repository with stronger checksums than what is provided upstream.

Are there other reasons?

Also the entire design around on_demand conflicts with being able to specify checksums. I don't think we should allow this, it should just mirror the checksums used upstream (at least for the packages), maybe the repomd.xml checksum could still be set.

So revisiting this feature, i think we should prioritize:

1. Repo publishing should never fail based on setting of any checksum type.
2. the user should be able to override the repomd.xml checksum type
3. For immediate repos, the user should be able to override the checksum type for all packages in primary.xml

#10 Updated by ttereshc 12 months ago

I agree with Justin's list of prioritised items.
However, I think it's not clear how publish should happen.
We can't assume that user just mirrors a repo from a remote.
- Imagine 2+ on_demand repos, each of them with different checksum type.
- User wants to create a new repo with RPMs from different repos.
- The new repo will contain RPMs with different checksum types.

The only way we can support this case and satisfy Justin's #1 is to publish with the only available checksum type for each individual RPM.
That means that in primary.xml, there will be a mixture of checksum types. If we can support it and no tooling breaks because of it, it will be the best case.
IIRC, there were such examples on the CDN but because Pulp was failing to publish those, a remote repo was updated to have all checksum types the same. It looked like dnf is fine with it. So we may want to double-check it if we consider supporting this case.

#11 Updated by bmbouter 8 months ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF