Project

Profile

Help

Issue #3949

Repositories can end up with RPMs with duplicate nevra after copy

Added by dustball about 3 years ago. Updated over 2 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

When synchronizing from http://ftp.halifax.rwth-aachen.de/opensuse/repositories/devel:/languages:/python/SLE_12_SP3/ (which in turn mirrors from the open build system), we have discovered a duplicate metadata definition in the primary.xml for the same package when a repository is copied:

<package type="rpm">
  <name>python-pyparsing</name>
  <arch>noarch</arch>
  <version epoch="0" rel="1.4" ver="2.2.0" />
  <checksum pkgid="YES" type="sha256">f0cb17cd16db6711cfc6a746fbe5bdf09fab61a516326b72b48270219174b332</checksum>
  <summary>Grammar Parser Library for Python</summary>
  <description>The pyparsing module is an alternative approach to creating and executing
simple grammars, vs. the traditional lex/yacc approach, or the use of regular
expressions. The pyparsing module provides a library of classes that client
code uses to construct the grammar directly in Python code.</description>
  <packager />
  <url>http://pyparsing.wikispaces.com/</url>
  <time build="1523554579" file="1523554593" />
  <size archive="790600" installed="788413" package="150252" />
  <location href="Packages/p/python-pyparsing-2.2.0-1.4.noarch.rpm"/>
  <format>
    <rpm:license>MIT and GPL-2.0+ and GPL-3.0+</rpm:license>
    <rpm:vendor>obs://build.opensuse.org/devel:languages:python</rpm:vendor>
    <rpm:group>Development/Languages/Python</rpm:group>
    <rpm:buildhost>lamb10</rpm:buildhost>
    <rpm:sourcerpm>python-pyparsing-2.2.0-1.4.src.rpm</rpm:sourcerpm>
    <rpm:header-range end="33464" start="440" />
    <rpm:provides>
      <rpm:entry epoch="0" flags="EQ" name="python-parsing" ver="2.2.0" />
      <rpm:entry epoch="0" flags="EQ" name="python-pyparsing" rel="1.4" ver="2.2.0" />
      <rpm:entry epoch="0" flags="EQ" name="python2-pyparsing" rel="1.4" ver="2.2.0" />
    </rpm:provides>
    <rpm:requires>
      <rpm:entry epoch="0" flags="EQ" name="python(abi)" ver="2.7" />
      <rpm:entry name="python-base" />
    </rpm:requires>
    <rpm:obsoletes>
      <rpm:entry epoch="0" flags="LT" name="python-parsing" ver="2.2.0" />
    </rpm:obsoletes>
  </format>
</package>

and

<package type="rpm">
  <name>python-pyparsing</name>
  <arch>noarch</arch>
  <version epoch="0" rel="1.4" ver="2.2.0" />
  <checksum pkgid="YES" type="sha256">443ce7c1a7a6c1dd9d2559bf045be8e11c28f951f4ab2336fbc83a53ee667a4f</checksum>
  <summary>Grammar Parser Library for Python</summary>
  <description>The pyparsing module is an alternative approach to creating and executing
simple grammars, vs. the traditional lex/yacc approach, or the use of regular
expressions. The pyparsing module provides a library of classes that client
code uses to construct the grammar directly in Python code.</description>
  <packager />
  <url>http://pyparsing.wikispaces.com/</url>
  <time build="1523550890" file="1523550908" />
  <size archive="790600" installed="788413" package="150209" />
  <location href="Packages/p/python-pyparsing-2.2.0-1.4.noarch.rpm"/>
  <format>
    <rpm:license>MIT and GPL-2.0+ and GPL-3.0+</rpm:license>
    <rpm:vendor>obs://build.opensuse.org/devel:languages:python</rpm:vendor>
    <rpm:group>Development/Languages/Python</rpm:group>
    <rpm:buildhost>obs-power8-05</rpm:buildhost>
    <rpm:sourcerpm>python-pyparsing-2.2.0-1.4.src.rpm</rpm:sourcerpm>
    <rpm:header-range end="33480" start="440" />
    <rpm:provides>
      <rpm:entry epoch="0" flags="EQ" name="python-parsing" ver="2.2.0" />
      <rpm:entry epoch="0" flags="EQ" name="python-pyparsing" rel="1.4" ver="2.2.0" />
      <rpm:entry epoch="0" flags="EQ" name="python2-pyparsing" rel="1.4" ver="2.2.0" />
    </rpm:provides>
    <rpm:requires>
      <rpm:entry epoch="0" flags="EQ" name="python(abi)" ver="2.7" />
      <rpm:entry name="python-base" />
    </rpm:requires>
    <rpm:obsoletes>
      <rpm:entry epoch="0" flags="LT" name="python-parsing" ver="2.2.0" />
    </rpm:obsoletes>
  </format>
</package>

The "original" repository (which has a feed and synchronizes correctly) does not have this issue. I suspect while the metadata is copied, old metadata is not purged. Note: Removing the broken primary.xml and publishing the copied repository again results in the same issue.

History

#1 Updated by dustball about 3 years ago

I may have an idea where this stems from. We have four repository groups: daily (which has feeds assigned and syncs daily), development, testing, and production. The packages get copied from daily all the way through to production, so of course production has the oldest packages.

Now as happened, the package got rebuilt for unknown reasons, but because some metadata is taken into the package, the checksum changed as well, but as it was only a rebuild and the actual function didn't change, the version of the package didn't get bumped either. Now I end up with two packages, the one with "1523550890" as it's build-time being in production, and "1523554579" being in current. Same package name, same size, different checksum. How does pulp deal with that?

Edit: For the record, I don't think this is an issue with pulp per se, as I've never before seen package rebuilds which end with different checksums but no bump of the release version.

#2 Updated by daviddavis about 3 years ago

The unit key (which determines uniqueness) for an rpm package includes its checksum[0] so two packages with the same nevra but different checksums are considered to be two distinct packages. The issue is that only one can be published in a repo since package name is <name>-<version>-<relver>.<arch>.rpm. Thus, the one that gets published to the filesystem is the one that should show up in the metadata.

A couple possible solutions:

1. When a package is synced/copied/uploaded to a repo, check for existing packages by nevra and remove any existing ones
2. Allow a repo to have two packages with the same nevra but when publishing, use the newest. Also, make sure the metadata and filesystem both point to the newest (and only the newest) package.

Also, @dkliban mentioned a tweak to option 2 where instead of just using the newest, an exception gets raised on publish for a repo that has 2 packages with the same nevra.

Edit: It looks like there is already code in the rpm plugin to handle deduplicating rpms by nevra[1]. So I think it may not be working?

[0] https://github.com/pulp/pulp_rpm/blob/6d6c8292d2abba37142ff4f2af73c3fe6c3cb90d/plugins/pulp_rpm/plugins/db/models.py#L779
[1] https://github.com/pulp/pulp_rpm/blob/2-master/plugins/pulp_rpm/plugins/importers/yum/purge.py#L284-L304

#3 Updated by daviddavis about 3 years ago

  • Project changed from Pulp to RPM Support

#4 Updated by dustball about 3 years ago

Neither solution is satisfying, because it would provide the newer package to the older repo, which is unwanted.

To give a bit of insight, the staging process goes "current" (synced, only ones that have feeds), dev, test, prod (in that order, and each repo is copied from the previous ones. There are never skips).

Now, having looked on disk for this specific package, current links to the current one, prod links to a different path (which I assume gives the old package, since the checksums don't match up), dev links to the new package, but in it's primary.xml lists the metadata for both, and the system verification has the checksum for the old version.

#5 Updated by dustball about 3 years ago

I just discovered that, after publishing again, prod has both checksums as well, although to my knowledge the packages weren't copied to prod yet (Last Updated: 2018-07-03T20:33:51Z).

#6 Updated by dustball about 3 years ago

I may have an idea on how to solve this, but I'm not sure how viable this actually is. Now, I assume the metadata is linked to the actual on-disk file, and during publishing pulp encounters the metadata, links the file, encounters the new metadata with the same nevra, and links that as well (thus removing the old file).

The solution would then be to prepend the first two characters of the checksum to the package file, making it unique. This should be possible by rewriting the "location href" tag.

Edit: I just realised this opens another issue. Since the nevra is still the same, how do the package managers determine which package to install? At that point, the build time would have to be set to epoch, if epoch is zero.

#7 Updated by daviddavis about 3 years ago

I was able to reproduce this. Here are the steps:

1. Sync down a repo (e.g. zoo)
2. Copy the contents from zoo into a new repo zoo2
3. Tweak one of the packages in the remote repo but keep the same nevra for the package. I changed the contents for bear-4.1-1-noarch.rpm
4. Sync zoo and copy its contents again into zoo2.
5. Publish zoo2

While zoo will have only one copy of bear-4.1-1-noarch.rpm, zoo2 will have two copies. Here's a snippet of the primary.xml.gz for zoo2:

<package type="rpm">
  <name>bear</name>
  <arch>noarch</arch>
  <version epoch="0" rel="1" ver="4.1" />
  <checksum pkgid="YES" type="sha256">7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3</checksum>
  <summary>A dummy package of bear</summary>
  <description>A dummy package of bear</description>
  <packager />
  <url>http://tstrachota.fedorapeople.org</url>
  <time build="1331831374" file="1445346933" />
  <size archive="296" installed="42" package="2438" />
  <location href="Packages/b/bear-4.1-1.noarch.rpm"/>
  <format>
    <rpm:license>GPLv2</rpm:license>
    <rpm:vendor />
    <rpm:group>Internet/Applications</rpm:group>
    <rpm:buildhost>smqe-ws15</rpm:buildhost>
    <rpm:sourcerpm>bear-4.1-1.src.rpm</rpm:sourcerpm>
    <rpm:header-range end="2289" start="872" />
    <rpm:provides>
      <rpm:entry epoch="0" flags="EQ" name="bear" rel="1" ver="4.1" />
    </rpm:provides>
  </format>
</package>
<package type="rpm">
  <name>bear</name>
  <arch>noarch</arch>
  <version epoch="0" rel="1" ver="4.1" />
  <checksum pkgid="YES" type="sha256">57c1ad73fab49404f407186b418a5fef34d4e4bc40a6fbef2a5f92c0a2137ee8</checksum>
  <summary>bear package!</summary>
  <description />
  <packager />
  <url>http://google.com</url>
  <time build="1536871775" file="1536871889" />
  <size archive="124" installed="0" package="5820" />
  <location href="Packages/b/bear-4.1-1.noarch.rpm"/>
  <format>
    <rpm:license>GPLv2</rpm:license>
    <rpm:vendor />
    <rpm:group>Development/Tools</rpm:group>
    <rpm:buildhost>pulp2.dev</rpm:buildhost>
    <rpm:sourcerpm>bear-4.1-1.src.rpm</rpm:sourcerpm>
    <rpm:header-range end="5704" start="4504" />
    <rpm:provides>
      <rpm:entry epoch="0" flags="EQ" name="bear" rel="1" ver="4.1" />
    </rpm:provides>
  </format>
</package>

It looks like the code we're using to remove duplicates after sync is not being run after copy. One possible solution is to throw an error on copy if there's already an rpm with a matching nevra in the destination repo. Adding this check could slow things down though I think.

#8 Updated by daviddavis about 3 years ago

  • Subject changed from The same package can have two (or more?) assigned metadata definitions to Repositories can end up with RPMs with duplicate nevra after copy

#9 Updated by CodeHeeler about 3 years ago

  • Triaged changed from No to Yes

#10 Updated by gmbnomis about 3 years ago

"copy" is the "/pulp/api/v2/repositories/<destination_repo_id>/actions/associate/" endpoint, right?

Actually, the current behavior is what we expect from Pulp (given the RPM unit key). We actively manage our staging repositories, i.e. we:

1. Compute what should be in a staging repo and associate it into it
2. Compute what is superfluous in the staging repo and un-associate it

Removing packets with duplicate NEVRA is a side effect of step 2 in our use case (usually it is done to remove outdated stuff and RPMs with duplicate NEVRA are always considered to be outdated).

Thus, getting an error on "associate" in case of duplicate NEVRA would actually hurt our use case: We would have to split up duplicate NEVRA and "outdated stuff" handling:

1. Compute duplicate NEVRA RPMs and unassociate
2. Compute what should be in a staging repo and associate it into it
3. Compute what is superfluous in the staging repo and un-associate it

However, this would leave the repository in a state with missing RPMs if we run into an error when associating. I don't like that very much, to be honest.

Re removing duplicates after association: Having content units disappear during an "associate" operation feels wrong, but I see that this can be very useful in other cases. I would prefer to have a flag (in the repo or the associate call) to control this behavior.

Btw, I like @dklibans's idea of throwing an exception at publication if there are content units that can't be represented in the publication (I think there is also another case today IIRC: two RPMs with the same filename will overwrite each other). But we are may be to late in the life cycle of Pulp 2 to change that(?)

#11 Updated by bmbouter over 2 years ago

  • Status changed from NEW to CLOSED - WONTFIX

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

#12 Updated by bmbouter over 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF