Issue #7141
closedlazy sync does not properly handle upstream repos with duplicate content but different repo layouts
Description
Say you have two repos that contain the same rpm, but at different paths:
os /Packages/f/foo.rpm
ks /Packages/foo.rpm
Now you sync them both using 'on_demand' , but lets say the os repo gets the unit imported first. The rpm unit gets created with a relativepath of:
/Packages/f/foo.rpm
and then a lazy_catalog_content entry gets created with a url of: https://server.example.com/os//Packages/f/foo.rpm
This is all correct, now the unit gets processsed for the ks repo. It correctly reuses the same unit, but then creates a 2nd lazy_catalog_content entry with a url of: https://server.example.com/ks/Packages/f/foo.rpm
Its using the relativepath of the rpm unit to build the lazy_catalog_content's url attribute. In reality this looks like:
> db.lazy_content_catalog.find({"path": {$regex: '.*libXxf86vm\-devel\-1\.1\.4\-9\.el8\.i686\.rpm'}})
{ "_id" : ObjectId("5f07ee48cc531034cce38acc"), "_ns" : "lazy_content_catalog", "path" : "/var/lib/pulp/content/units/rpm/8a/cd9d02545dff8fab381aaa6185a778a26cacbec1585bcd8f7b2f6509f254a2/libXxf86vm-devel-1.1.4-9.el8.i686.rpm", "importer_id" : "5f07ed47cc53103b7b1f02c9", "unit_id" : "305ec066-9d0f-46a7-a198-6b966218a40e", "unit_type_id" : "rpm", "url" : "https://cdn.redhat.com/content/dist/rhel8/8.2/x86_64/appstream/kickstart/Packages/libXxf86vm-devel-1.1.4-9.el8.i686.rpm", "checksum" : "e375334723b40b39a407d243d1dab859a6edf1b2b383faa68c257c1afb399e2f", "checksum_algorithm" : "sha256", "revision" : 1, "data" : { } }
{ "_id" : ObjectId("5f07ef17cc531034b8afd793"), "_ns" : "lazy_content_catalog", "path" : "/var/lib/pulp/content/units/rpm/8a/cd9d02545dff8fab381aaa6185a778a26cacbec1585bcd8f7b2f6509f254a2/libXxf86vm-devel-1.1.4-9.el8.i686.rpm", "importer_id" : "5f07ed0dcc53103b7b1f02b5", "unit_id" : "305ec066-9d0f-46a7-a198-6b966218a40e", "unit_type_id" : "rpm", "url" : "https://cdn.redhat.com/content/dist/rhel8/8/x86_64/appstream/os/Packages/libXxf86vm-devel-1.1.4-9.el8.i686.rpm", "checksum" : "e375334723b40b39a407d243d1dab859a6edf1b2b383faa68c257c1afb399e2f", "checksum_algorithm" : "sha256", "revision" : 1, "data" : { } }
Directions to reproduce:
- Sync the rhel 8 base os repo using on_demand
- Sync the rhel 8 kickstart repo using on_demand
attempt to fetch each rpm from the kickstart repo or base os repo (maybe a random assortment of each)
Results, you will get a lot of 404s from the streamer app:
Jul 13 17:19:35 dhcp-8-30-46 pulp_streamer: pulp.streamer.server:INFO: Download failed [404]: https://cdn.redhat.com/content/dist/rhel8/8/x86_64/appstream/os/Packages/texlive-luatex85-20180414-14.el8.noarch.rpm
This is because its using the wrong relative path when fetching rpms from the kickstart repo. Its non-deterministic as to which lazy_content_catalogue entry it will pick, so some will get a 404 and some won't. Re-trying to download an rpm again, may result it in working.
Updated by rchan over 2 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
- Sprint set to Sprint 77
dkliban says: investigated how to fix ^ and got a patch working - will make a PR tomorrow - we will also need to write a script for users to cleanup their DB and fix existing systems.
Updated by pulpbot over 2 years ago
- Status changed from ASSIGNED to POST
Updated by dkliban@redhat.com over 2 years ago
- Status changed from POST to CLOSED - WORKSFORME
Even though I said that this bug exists in Pulp, it was only based on my initial reading of the code. I've tried to reproduce the bug and I was not able to. Lazy Catalog Entries are being created correctly for each repository layout. After reading the code again, the correct behavior makes sense. The problem experienced by Katello users is most likely related to the fact that repositories created by Katello use custom importers and/or distributors. I'll help investigate it from the Katello side.