Issue #7141

lazy sync does not properly handle upstream repos with duplicate content but different repo layouts

Added by over 1 year ago. Updated over 1 year ago.

Start date:
Due date:
Estimated time:
2. Medium
Platform Release:
Sprint Candidate:
Pulp 2
Sprint 77


Say you have two repos that contain the same rpm, but at different paths:

os /Packages/f/foo.rpm

ks /Packages/foo.rpm

Now you sync them both using 'on_demand' , but lets say the os repo gets the unit imported first. The rpm unit gets created with a relativepath of:


and then a lazy_catalog_content entry gets created with a url of:

This is all correct, now the unit gets processsed for the ks repo. It correctly reuses the same unit, but then creates a 2nd lazy_catalog_content entry with a url of:

Its using the relativepath of the rpm unit to build the lazy_catalog_content's url attribute. In reality this looks like:

> db.lazy_content_catalog.find({"path": {$regex: '.*libXxf86vm\-devel\-1\.1\.4\-9\.el8\.i686\.rpm'}})
{ "_id" : ObjectId("5f07ee48cc531034cce38acc"), "_ns" : "lazy_content_catalog", "path" : "/var/lib/pulp/content/units/rpm/8a/cd9d02545dff8fab381aaa6185a778a26cacbec1585bcd8f7b2f6509f254a2/libXxf86vm-devel-1.1.4-9.el8.i686.rpm", "importer_id" : "5f07ed47cc53103b7b1f02c9", "unit_id" : "305ec066-9d0f-46a7-a198-6b966218a40e", "unit_type_id" : "rpm", "url" : "", "checksum" : "e375334723b40b39a407d243d1dab859a6edf1b2b383faa68c257c1afb399e2f", "checksum_algorithm" : "sha256", "revision" : 1, "data" : {  } }
{ "_id" : ObjectId("5f07ef17cc531034b8afd793"), "_ns" : "lazy_content_catalog", "path" : "/var/lib/pulp/content/units/rpm/8a/cd9d02545dff8fab381aaa6185a778a26cacbec1585bcd8f7b2f6509f254a2/libXxf86vm-devel-1.1.4-9.el8.i686.rpm", "importer_id" : "5f07ed0dcc53103b7b1f02b5", "unit_id" : "305ec066-9d0f-46a7-a198-6b966218a40e", "unit_type_id" : "rpm", "url" : "", "checksum" : "e375334723b40b39a407d243d1dab859a6edf1b2b383faa68c257c1afb399e2f", "checksum_algorithm" : "sha256", "revision" : 1, "data" : {  } }

Directions to reproduce:

  1. Sync the rhel 8 base os repo using on_demand
  2. Sync the rhel 8 kickstart repo using on_demand

attempt to fetch each rpm from the kickstart repo or base os repo (maybe a random assortment of each)

Results, you will get a lot of 404s from the streamer app:

Jul 13 17:19:35 dhcp-8-30-46 pulp_streamer: pulp.streamer.server:INFO: Download failed [404]:

This is because its using the wrong relative path when fetching rpms from the kickstart repo. Its non-deterministic as to which lazy_content_catalogue entry it will pick, so some will get a 404 and some won't. Re-trying to download an rpm again, may result it in working.


#1 Updated by over 1 year ago

  • Description updated (diff)

#3 Updated by rchan over 1 year ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to
  • Sprint set to Sprint 77

dkliban says: investigated how to fix ^ and got a patch working - will make a PR tomorrow - we will also need to write a script for users to cleanup their DB and fix existing systems.

#4 Updated by ttereshc over 1 year ago

  • Triaged changed from No to Yes

#5 Updated by pulpbot over 1 year ago

  • Status changed from ASSIGNED to POST

#6 Updated by over 1 year ago

  • Status changed from POST to CLOSED - WORKSFORME

Even though I said that this bug exists in Pulp, it was only based on my initial reading of the code. I've tried to reproduce the bug and I was not able to. Lazy Catalog Entries are being created correctly for each repository layout. After reading the code again, the correct behavior makes sense. The problem experienced by Katello users is most likely related to the fact that repositories created by Katello use custom importers and/or distributors. I'll help investigate it from the Katello side.

Also available in: Atom PDF