Project

Profile

Help

Issue #1287

Repo sync failing with KeyError

Added by XenoPhage over 5 years ago. Updated about 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
1. Low
Version:
2.6.2 Beta
Platform Release:
2.11.1
OS:
RHEL 7
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 13
Quarter:

Description

Greetings,

I'm using Pulp with foreman and katello to sync a number of repos. One of those repos is the gitlab repo located here :

https://packages.gitlab.com/gitlab/gitlab-ce/el/7/x86_64

The repo continually fails to sync, giving this error in the logs :

Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) <bound method ContainerListener.download_succeeded of <pulp_rpm.plugins.importers.yum.repomd.alternate.ContainerListener object at 0x425ddd0>>
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) Traceback (most recent call last):
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) File "/usr/lib/python2.7/site-packages/pulp/server/content/sources/container.py", line 148, in _forward
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) method(request)
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/repomd/alternate.py", line 126, in download_succeeded
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) self.content_listener.download_succeeded(report)
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/listener.py", line 79, in download_succeeded
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) self.metadata_files.add_repodata(model)
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/repomd/metadata.py", line 327, in add_repodata
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) raw_xml = db_file[db_key]
Sep 30 12:22:14 katello pulp: pulp.server.content.sources.container:ERROR: (11499-77664) KeyError: 'arch:x86_64::epoch:0::name:gitlab-ce::release:1::version:7.10.1~omnibus.2'

It was suggested in the #foreman IRC channel that tildes may be the cause of this problem. I'm marking this as a pulp 2.6.2 install, but I'm not entirely certain. rpm suggests 2.6.2 based on package names.


Checklist


Related issues

Has duplicate Pulp - Issue #2003: Pulp will not sync RPM files hosted using "packagecloud".CLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Has duplicate RPM Support - Issue #1932: More gracefully handle KeyError when an rpm has no data in filelists.xml nor in other.xmlCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision dd640422 View on GitHub
Added by ttereshc over 4 years ago

Fail sync and report early in case of inconsistency in metadata

Sync task will fail when:

  • any required metadata files, like filelists.xml, are missing
  • filelists.xml or other.xml does not contain metadata for all the packages in the repository

closes #1287 https://pulp.plan.io/issues/1287

History

#1 Updated by mhrivnak over 5 years ago

  • Priority changed from Normal to High
  • Severity changed from 2. Medium to 3. High
  • Triaged changed from No to Yes

#2 Updated by ipanova@redhat.com over 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to ipanova@redhat.com

#3 Updated by ipanova@redhat.com over 5 years ago

It is not the tilde issue, but the fact that filelists.xml.gz is empty https://packages.gitlab.com/gitlab/gitlab-ce/el/7/x86_64/repodata/badbe13e3379bc67c05d60367bc9ef462bea9ccf-filelists.xml.gz

<?xml version="1.0" encoding="UTF-8"?>
<filelists xmlns="http://linux.duke.edu/metadata/filelists" packages="0"/>

Filelists.xml.gz usually contains the complete information about all the packages in the repository. It is empty, so that's why sync is not able o fetch details of all the packages in the repository.

#4 Updated by XenoPhage over 5 years ago

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ok, so that makes this an issue for the repo manager, correct? Or
should pulp be handling this differently?

On 10/13/15 06:46, Pulp wrote:

#5 Updated by bmbouter over 5 years ago

The following line is in the primary.xml.gz or the filelists.xml.gz?

<filelists xmlns="http://linux.duke.edu/metadata/filelists" packages="0"/>

Is this expected given the contents of the repo or not?

#6 Updated by ipanova@redhat.com over 5 years ago

Brian, the following line is from filelists.xml.gz, and it is not supposed to be like this. The filelists.xml.gz is broken so at least search on files will not be working. This kind of filelists is produced when you're tying to produce repodata in empty directory.
Nevertheless, reposync from (yum-utils)can sync the content even from the repository with this kind of issues. So the question is - do we want to implement the workaround taking into account this situation or not.

#7 Updated by mhrivnak over 5 years ago

I started a discussion last week via email about whether it is ok for the filelists collection to be empty: http://lists.baseurl.org/pipermail/yum-devel/2015-October/010778.html

#9 Updated by ipanova@redhat.com over 5 years ago

So I got some answers from yum devs.

1) From createrepo point of view it is NOT valid for filelists to be empty. Usually when creating repodata 'createrepo' or 'createrepo_c' is used. And createrepo/createrepo_c will not create empty filelists unless it was done from empty directory. Digging into this problematic repo, there is very strange comment line in primary.xml file: <!--generated by the amazing packagecloud.io RPM indexer-->. Probably they are using their own tools based on createrepo(?) that most likely has issues.

Another weird point is - Usually lists of the files are put only into filelists OR into primary AND filelists. In this case i can see only it only in primary file, which is strange.

2) With empty filelists at least search on files will not be working.

All this points were discussed/noticed and confirmed by yum dev()

#10 Updated by ipanova@redhat.com over 5 years ago

Another point is that primary.xml can contain filelists for package but only for certain file. Taken from createrepo docs [0]: "file lists for the package for CERTAIN files - specifically files matching: /etc*, bin/, /usr/lib/sendmail. Normal create repo would put these files in both places like primary and filelists, in our case they are present only in primary, probably this way guys wanted to save some space.

[0] http://createrepo.baseurl.org/#rpmmetadata

#11 Updated by ipanova@redhat.com over 5 years ago

  • Severity changed from 3. High to 1. Low

So the outcome of research and debugging is - gitlab repositories are malformed, but we still need to fail gracefully when there is data missing in filelists.

#12 Updated by ipanova@redhat.com over 5 years ago

  • Status changed from ASSIGNED to NEW

#13 Updated by XenoPhage over 5 years ago

FWIW, I spoke with the folks at PackageCloud. This is actually a feature they offer. The argument is that creating the filelists.xml.gz file can take a long time for large repositories and they allow users to disable them for performance reasons. I have passed a link to this ticket on to them for review.

#14 Updated by joedamato over 5 years ago

I wrote the packagecloud YUM indexer and would like to modify it to allow Pulp to work properly.

In the primary.xml, we do provide the list of files that match the documented regexps as explained in the createrepo source code. So, any code in pulp which expects these files to exist will work. This cannot be disabled by the repo owner.

We do, however, offer users the option to disable filelists.xml generation. We offer this option for three main reasons:

1.) In repositories with large packages (packages created with Chef Omnibus: https://github.com/chef/omnibus), the number of files can grow to be enormous. The typical omnibus package can have about 50k files in a single package. If a repo has, say, 10 versions of a package and each version has ~50k files, the filelists.xml.gz can get large quite quickly. This becomes problematic for users who are disk-space and network bandwidth sensitive (embedded systems, systems running entirely in ram, etc). I've personally seen filelists.xml.gz grow to be larger than 800mb. Decompressed, I've seen filelists.xml reach over 2.5gb.

2.) In the cases mentioned in point 1, filelists don't provide any value to the end user. The purpose of filelists is to allow an end user to query a file regexp and figure out which package provided that file. Omnibus packages have tens of thousands of files so any system library, binary, data file, etc that you'd write a regexp for would likely cause matches against the files installed via omnibus. This is specific to this style of packaging as many system tools are rebuilt and embedded directly in the package itself.

3.) And a (much) less important reason: generating a huge filelist takes some time. We don't really care about option 3 so much, and we allow our users to enable filelist generation if they choose.

I would humbly propose that pulp be modified to deal with the case where filelists.xml and other.xml are not provided, or are empty.

I would be happy to help in any way possible including modifying our indexer to produce some XML that would be easier for pulp to process when the user has explicitly decided to not provide these files.

For example, we could modify the filelists.xml to provide something like this:

<filelists xmlns="http://linux.duke.edu/metadata/filelists" packages="1"/>
<package pkgid='x' name='packagename' arch='x86_64'>
<version epoch="1" ver="1.1" rel="3"/>
</package>

In this way, the package count would match and the rest of the package metadata would be available, but there would simply be no "<file>" entries available for each package.

#15 Updated by mhrivnak over 5 years ago

Thank you for taking the time to get in touch, and for obviously putting a lot of thought into this. I likewise would like pulp to be able to reliably sync repos from packagecloud.

I think the core of the problem we're facing is that RPM as a packaging format is getting used in ways it was never intended for. Pulp has faced other problems from large file lists on packages (specifically ones from chef omnibus), such as exceeding the max document size of our database.

This is compounded by the fact that the repo metadata schema is not documented, so the best authority we have is looking at how createrepo, yum and dnf behave (which themselves change over time).

Your proposal sounds reasonable. Here is my general concern: when pulp does a sync, it stores the upstream XML for each rpm, and simply re-assembles those snippets later. This saves tremendous time, especially when publishing a repo, but it means pulp has to trust that the upstream metadata is sane and valid. If it is not sane and valid, and pulp publishes it anyway, our users will rightfully blame pulp for publishing invalid or broken repositories.

If you want to present metadata that is incomplete, but theoretically valid (this is where it stinks that there is no schema doc to consult), I'm ok with letting pulp use that. Ensuring that each package is listed seems like a good improvement. Yum, dnf, and other clients can reasonably assume that given an rpm's nevra as found in primary.xml, it will be present in the filelist. I think it is much less likely that a client will assume a specific file to be present in the list for a specific rpm. This greatly diminishes the chance that a user of yum/dnf/etc will experience a problem.

This should be fairly easy to test. If you would like any help with that, I suggest asking on our email list.

Thanks!

#16 Updated by ipanova@redhat.com about 5 years ago

  • Assignee deleted (ipanova@redhat.com)

#17 Updated by ale about 5 years ago

Hi everyone,

is there any update regarding this issue? I am running pulp 2.8.0 and is still happening when trying to sync gitlab-ce omnibus repo from packagecloud.io

Is there any real possibility to "fix" it?

Thanks

#19 Updated by cristi.falcas@gmail.com almost 5 years ago

We have the same problem: can't import gitlab repos.

This is the error:

mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) Listener error on event: succeeded
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) Traceback (most recent call last):
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp/server/content/sources/event.py", line 39, in call
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) listener.on_event(self)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp/server/content/sources/event.py", line 135, in on_event
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) method(event.request)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/repomd/alternate.py", line 119, in on_succeeded
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) self.content_listener.download_succeeded(report)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/listener.py", line 201, in download_succeeded
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) self.sync.add_rpm_unit(self.metadata_files, unit)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/sync.py", line 589, in add_rpm_unit
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) metadata_files.add_repodata(unit)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/repomd/metadata.py", line 338, in add_repodata
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) raw_xml = db_file[db_key]
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) KeyError: '\xff\xff\xff\xff\xd0\xc2\xce\x04'

#20 Updated by ale almost 5 years ago

Hi there,

I emailed both parties and due to the nature of gitlab repos management software you cannot use pulp with them. Just download the omnibus package and install from local.

cheers

wrote:

We have the same problem: can't import gitlab repos.

This is the error:

mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) Listener error on event: succeeded
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) Traceback (most recent call last):
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp/server/content/sources/event.py", line 39, in call
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) listener.on_event(self)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp/server/content/sources/event.py", line 135, in on_event
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) method(event.request)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/repomd/alternate.py", line 119, in on_succeeded
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) self.content_listener.download_succeeded(report)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/listener.py", line 201, in download_succeeded
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) self.sync.add_rpm_unit(self.metadata_files, unit)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/sync.py", line 589, in add_rpm_unit
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) metadata_files.add_repodata(unit)
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/importers/yum/repomd/metadata.py", line 338, in add_repodata
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) raw_xml = db_file[db_key]
mai 17 06:43:43 v-so-repo-01.company.net pulp[19775]: pulp.server.content.sources.event:ERROR: (19775-82912) KeyError: '\xff\xff\xff\xff\xd0\xc2\xce\x04'

#21 Updated by ipanova@redhat.com almost 5 years ago

  • Checklist item we need to fail gracefully when there is data missing in filelists added

#22 Updated by ipanova@redhat.com almost 5 years ago

  • Has duplicate Issue #2003: Pulp will not sync RPM files hosted using "packagecloud". added

#25 Updated by ttereshc over 4 years ago

  • Related to Issue #1932: More gracefully handle KeyError when an rpm has no data in filelists.xml nor in other.xml added

#26 Updated by mhrivnak over 4 years ago

  • Subject changed from Repo sync failing with KeyError - Possibly due to tilde? to Repo sync failing with KeyError

#27 Updated by mhrivnak over 4 years ago

  • Related to deleted (Issue #1932: More gracefully handle KeyError when an rpm has no data in filelists.xml nor in other.xml)

#28 Updated by mhrivnak over 4 years ago

  • Has duplicate Issue #1932: More gracefully handle KeyError when an rpm has no data in filelists.xml nor in other.xml added

#29 Updated by ipanova@redhat.com over 4 years ago

  • Sprint/Milestone set to 31

#30 Updated by ttereshc over 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to ttereshc

#31 Updated by ttereshc over 4 years ago

  • Status changed from ASSIGNED to POST

#32 Updated by ttereshc over 4 years ago

  • Checklist item we need to fail gracefully when there is data missing in filelists set to Done

#33 Updated by ttereshc over 4 years ago

  • Status changed from POST to MODIFIED

#35 Updated by semyers over 4 years ago

  • Platform Release set to 2.11.1

#36 Updated by semyers over 4 years ago

  • Status changed from MODIFIED to 5

#38 Updated by semyers over 4 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#39 Updated by bmbouter about 3 years ago

  • Sprint set to Sprint 13

#40 Updated by bmbouter about 3 years ago

  • Sprint/Milestone deleted (31)

#41 Updated by bmbouter about 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF