Project

Profile

Help

Story #6353

closed

As a user, I can mirror RPM repository content and metadata

Added by dkliban@redhat.com almost 5 years ago. Updated over 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Katello
Sprint:
Sprint 98
Quarter:

Description

Motivation

  • Clients installing packages from RPM mirrors hosted by Pulp don't have access to the original metadata provided in the remote repository.
  • There are problems with caching and/or load-balancing if multiple instances of pulp produce different metadata syncing from the same remote repository.
  • If a repo contains duplicated content under different paths, such repo can't be synced at all, unless a path is a part of the content natural key.

Proposed solution.

Add ability to create repository versions that contain the original metadata from the remote repository.

This could be accomplished by the following:

  • Have a way to distinguish between repositories with managed content and with the exact mirror (e.g. create a repository with exact_mirror=True or a new dedicated repository type, RpmMirrorRepository)
  • For such repos, create a publication at sync time (includes published artifacts and metadata).
  • For such repos, publish is no-op and always returns the existing publication for the requested repo version.
  • For such repos, no modifications are allowed except the sync in mirror mode.

Pros

Cons

  • doesn't solve the problem of various relative paths for the same content in general way
  • a separate code path (at times) to handle this type of repositories.

Related issues

Related to Pulp - Story #5200: Support 'mirrored' metadataCLOSED - WONTFIX

Actions
Related to Pulp - Story #8687: DeclarativeVersion API doesn't accomodate metadata mirroring use casesCLOSED - DUPLICATE

Actions
Related to RPM Support - Story #8673: Auto-publishing should be more fault-tolerantCLOSED - DUPLICATE

Actions
Related to Pulp - Story #8856: As a user, I have a convenient UX for mirroring repositoriesCLOSED - DUPLICATE

Actions
Blocked by Pulp - Story #7815: As a plugin writer, pulpcore ensures that a job working directory is set/removed properlyCLOSED - CURRENTRELEASEdalley

Actions
Actions #1

Updated by dkliban@redhat.com almost 5 years ago

  • Description updated (diff)
Actions #2

Updated by ttereshc almost 5 years ago

  • Sprint/Milestone set to Pulp 3.x RPM (Katello 3.16)
Actions #3

Updated by lmjachky over 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to lmjachky
Actions #4

Updated by rchan over 4 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (lmjachky)
  • Sprint/Milestone deleted (Pulp 3.x RPM (Katello 3.16))
Actions #5

Updated by rchan over 4 years ago

  • Sprint/Milestone set to Pulp 3.x RPM (Katello 3.16)
Actions #6

Updated by ttereshc over 4 years ago

  • Sprint/Milestone changed from Pulp 3.x RPM (Katello 3.16) to Priority items (outside of planned milestones/releases)
Actions #7

Updated by ttereshc over 4 years ago

  • Priority changed from Normal to High
Actions #8

Updated by ttereshc over 4 years ago

  • Sprint/Milestone changed from Priority items (outside of planned milestones/releases) to Pulp 3.x RPM (Katello 4.1)
Actions #9

Updated by jsherril@redhat.com over 4 years ago

The RPMDistribution will need to support users providing a repository or a repository version in addition to publications.

ideally we wouldn't have to generate a normal yum publication when going this route, as those are quite expensive to generate.

Actions #10

Updated by dkliban@redhat.com over 4 years ago

wrote:

The RPMDistribution will need to support users providing a repository or a repository version in addition to publications.

ideally we wouldn't have to generate a normal yum publication when going this route, as those are quite expensive to generate.

You would not need to create a publication. That's why we need to be able to serve the repository version directly.

Actions #11

Updated by ttereshc over 4 years ago

  • Priority changed from High to Normal
Actions #12

Updated by ttereshc over 4 years ago

  • Related to Story #5200: Support 'mirrored' metadata added
Actions #13

Updated by Anonymous over 4 years ago

This feature would be very welcome in RHUI as it would save us from regenerating the metadata every time the repo content is updated. So yes, it gets our votes!

Actions #14

Updated by ttereshc about 4 years ago

  • Description updated (diff)
Actions #15

Updated by dalley about 4 years ago

Open question: Should the DeclarativeContent pipeline be extended to allow this functionality, or should it remain entirely within the plugin?

The latter might make more sense for the initial implementation, but if Debian wants to switch to this method we might want to be able to share the implementation.

This is a separate question from the invasive generic proposal.

Actions #16

Updated by ipanova@redhat.com about 4 years ago

dalley wrote:

Open question: Should the DeclarativeContent pipeline be extended to allow this functionality, or should it remain entirely within the plugin?

The latter might make more sense for the initial implementation, but if Debian wants to switch to this method we might want to be able to share the implementation.

I would suggest keeping the changes for now entirely in the plugin. Both RPM and Debian plugins are having a complex pipeline, would be good to first implement the proposal and then decouple what can be shared.

This is a separate question from the invasive generic proposal.

Actions #17

Updated by ipanova@redhat.com about 4 years ago

This could be accomplished by the following:

Have a way to distinguish between repositories with managed content and with the exact mirror (e.g. create a repository with exact_mirror=True or a new dedicated repository type, RpmMirrorRepository)

I think having a separate repo type will be a cleaner solution, we can disable endpoints we do not want to expose , for example /modify endpoint and also take control over what options to enable. I agree that this type of the repo should be immutable, meaning no content can be added to it or removed from it.

For such repos, create a publication at sync time (includes published artifacts and metadata).

I wonder how we would leave the room to the user to specify signing_service and gpg_check options?

For such repos, publish is no-op and always returns the existing publication for the requested repo version.

Apparently in this step we could allow user to re-publish the repo with signing_service and gpg_check options if needed, but definitely not allowing setting checksum_types

For such repos, no modifications are allowed except the sync in mirror mode.

I guess we should not allow skipping types.

I am wondering - has it been considered to add metadata as a separate content typse to the mirror repo type? This could allow us to distribute the repository right away without the need of creating the publication. On the other hand i would not know how we'd allow user setting a gpg_check option, for example. This idea is obviously far from being flawless, just throwing it on the table for discussion.

Actions #18

Updated by ttereshc about 4 years ago

I wonder how we would leave the room to the user to specify signing_service and gpg_check options?

I'm not sure that we need to provide a way to sign repo metadata here. The idea is to have a pure mirror of the remote repo without any changes. But maybe I'm just not aware of a use case and customers will be interested in it. I'm open for feedback here.

I am wondering - has it been considered to add metadata as a separate content type to the mirror repo type? This could allow us to distribute the repository right away without the need of creating the publication. On the other hand i would not know how we'd allow user setting a gpg_check option, for example. This idea is obviously far from being flawless, just throwing it on the table for discussion.

I believe it was one of the initial ideas. One of the reason the current proposal is different because it's potentially a one step closer to a more generalised solution for the relative_path problem while the separate content type for metadata won't help in such case. Another potential concern is managing this content type in some special way... it's not a content type you want to show to a user and you need to disallow copy operation for it, aka not associate this content with any other repo, etc.

Actions #19

Updated by dalley almost 4 years ago

  • Blocked by Story #7815: As a plugin writer, pulpcore ensures that a job working directory is set/removed properly added
Actions #20

Updated by dalley almost 4 years ago

  • Sprint set to Sprint 90
  • Tags Katello added
Actions #21

Updated by rchan almost 4 years ago

  • Sprint changed from Sprint 90 to Sprint 91
Actions #22

Updated by rchan almost 4 years ago

  • Sprint changed from Sprint 91 to Sprint 92
Actions #23

Updated by rchan almost 4 years ago

  • Sprint changed from Sprint 92 to Sprint 93
Actions #24

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 93 to Sprint 94
Actions #25

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 94 to Sprint 95
Actions #26

Updated by dalley over 3 years ago

The initial proposal was to add some parameter to the repository, such as "mirror=True", and use this value to control how the remote behaves, or else to create an entire new model specifically about "mirroring".

I think we don't need such large changes to support this, and the less-invasive and lower-effort approach might actually provide better opportunities for improving Pulp workflows as a whole.

Instead of trying to implement this functionality at the data "model" layer, we should just provide a new viewset. We make no model changes whatsoever. We introduce a new RemoteRepositoryViewset which will provide a new primitive Pulp endpoint, /pulp/api/v3/remote_repositories/<plugin>/<type>/. This viewset will work with all repositories that have remote set, which is to say that the queryset would be Repository.objects.filter(remote__isnull=False).

This viewset would provide only one action endpoint, "refresh" (or "update"), which would perform a mirror-mode sync and publish with the internally-attached remote. The viewset would not provide the "modify" endpoint since mirrored repositories shouldn't be able to be changed, and it wouldn't provide the "sync" endpoint because there would not be any need to customize the sync in that way. This avoids needing to throw errors if a user attempts an operation they shouldn't do - because the new API provides no mechanism to do it - and also avoids creating a new model.

The old /repositories/<plugin>/<type>/ endpoint will remain exactly as it is currently, even at the code level, since all of the changes are in the new viewset. However we should discourage users from using this API with repositories they want to sync from external sources, and encourage using the new API instead, which has the benefit of being much easier to use for most if not all use cases.

The workflow:

  1. Create a "remote repository", specifying name + description + any remote options, which will transparently create both the remote and repository.
  2. "refresh" (or "update" or "sync") it - no parameters necessary because it uses the internal remote. syncs for most plugins would always "mirror"

If a user just wants local mirrors, they're already done. For use cases where a user wants to modify or combine repos, we would encourage the "Katello workflow" of just copying all content from their remote repo into the new one.

Compared to syncing from many remotes into one repo, this workflow:

  • Is much more efficient (less network traffic, only parse metadata once, no need to query existing Content or Artifacts)
  • Is more deterministic / reliable (the remote URLs could change between syncs, or the syncs could accidentally use different mirrors)
Actions #27

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 95 to Sprint 96
Actions #28

Updated by bmbouter over 3 years ago

+1 to the plan outlined in comment 26. I think that takes us in the right direction and would be much for usable than having API endpoints which are available but throw errors when repositories don't have remotes, or when those that are only for mirroring must never use the /modify endpoint.

Actions #29

Updated by ggainey over 3 years ago

bmbouter wrote:

+1 to the plan outlined in comment 26. I think that takes us in the right direction and would be much for usable than having API endpoints which are available but throw errors when repositories don't have remotes, or when those that are only for mirroring must never use the /modify endpoint.

Concur with all of the above. While there may be some "what happens if a user does X" edge-cases to consider/close, this feels like a great approach!

Actions #30

Updated by ipanova@redhat.com over 3 years ago

dalley wrote:

The initial proposal was to add some parameter to the repository, such as "mirror=True", and use this value to control how the remote behaves, or else to create an entire new model specifically about "mirroring".

I think we don't need such large changes to support this, and the less-invasive and lower-effort approach might actually provide better opportunities for improving Pulp workflows as a whole.

Instead of trying to implement this functionality at the data "model" layer, we should just provide a new viewset. We make no model changes whatsoever. We introduce a new RemoteRepositoryViewset which will provide a new primitive Pulp endpoint, /pulp/api/v3/remote_repositories/<plugin>/<type>/. This viewset will work with all repositories that have remote set, which is to say that the queryset would be Repository.objects.filter(remote__isnull=False).

If other plugins would not take advantage of such workflows and it is not known of such as of today, my suggestion would be to use a pulp-rpm endpoint /pulp/api/v3/pulp_rpm/<mirrored-repos>/

My worry is that concepts like mirror/sync/remote repo/ are overused, and in plugins that don't have any repodata might create more confusion than desired.

This viewset would provide only one action endpoint, "refresh" (or "update"), which would perform a mirror-mode sync and publish with the internally-attached remote. The viewset would not provide the "modify" endpoint since mirrored repositories shouldn't be able to be changed, and it wouldn't provide the "sync" endpoint because there would not be any need to customize the sync in that way. This avoids needing to throw errors if a user attempts an operation they shouldn't do - because the new API provides no mechanism to do it - and also avoids creating a new model.

The old /repositories/<plugin>/<type>/ endpoint will remain exactly as it is currently, even at the code level, since all of the changes are in the new viewset. However we should discourage users from using this API with repositories they want to sync from external sources, and encourage using the new API instead, which has the benefit of being much easier to use for most if not all use cases.

The workflow:

  1. Create a "remote repository", specifying name + description + any remote options, which will transparently create both the remote and repository.
  2. "refresh" (or "update" or "sync") it - no parameters necessary because it uses the internal remote. syncs for most plugins would always "mirror"

If choosing between refresh/update/sync I would probably go with 'mirror' (or 'replicate'? too much, overkill?) Even though we have a sync mode option 'mirror' which does not alleviate the tautology.

If a user just wants local mirrors, they're already done. For use cases where a user wants to modify or combine repos, we would encourage the "Katello workflow" of just copying all content from their remote repo into the new one.

If one wants to get rid of 1 rpm copying the whole repo is a major inconvenience.

If understood correctly, whether I create a repo via the new endpoint or old one, if it has a remote it will appear in both endpoints. I know, what i am going to propose might be confusing but the user instead of copying content could just use the old API. It will require us to document in a very clear way, but why not leaving to the user both options where he could switch between new and old api based on the needs.

Compared to syncing from many remotes into one repo, this workflow:

  • Is much more efficient (less network traffic, only parse metadata once, no need to query existing Content or Artifacts)
  • Is more deterministic / reliable (the remote URLs could change between syncs, or the syncs could accidentally use different mirrors)
Actions #31

Updated by dalley over 3 years ago

  • Related to Story #8687: DeclarativeVersion API doesn't accomodate metadata mirroring use cases added
Actions #32

Updated by dalley over 3 years ago

If other plugins would not take advantage of such workflows and it is not known of such as of today, my suggestion would be to use a pulp-rpm endpoint /pulp/api/v3/pulp_rpm//

My worry is that concepts like mirror/sync/remote repo/ are overused, and in plugins that don't have any repodata might create more confusion than desired.

This is definitely something that every plugin could take advantage of to make the workflow simpler, even if they don't use mirrored metadata. Instead of creating a remote + repo they would just create one remote_repository, and instead of having to keep track of the remote to pass it to /sync/ every time, they would just refresh (or whatever).

The fact that it fixes the metadata mirroring UX is a nice bonus.

If choosing between refresh/update/sync I would probably go with 'mirror' (or 'replicate'? too much, overkill?) Even though we have a sync mode option 'mirror' which does not alleviate the tautology.

Yeah, we can have a nice long bikeshed discussion about the names later on :)

If one wants to get rid of 1 rpm copying the whole repo is a major inconvenience.

If understood correctly, whether I create a repo via the new endpoint or old one, if it has a remote it will appear in both endpoints. I know, what i am going to propose might be confusing but the user instead of copying content could just use the old API. It will require us to document in a very clear way, but why not leaving to the user both options where he could switch between new and old api based on the needs.

This is a legitimate concern and the suggestion is definitely something the user could do if they wanted to, but I don't think we should encourage it. We can almost certainly come up with better ways to address the problem.

If a user is doing this, they're already running the risk of having the next sync wipe out the changes (package removals) they made, so it's an error-prone workflow to begin with. It's also not easy to tell what you've changed vs. what changed remotely because the history is all mixed together. Wheras with the copy API you can use "base_version=... remove_content_units=[...]" to exclude your couple of content units, and you can see what changes you made.**

**this might not be true if the repo version "added", "removed" aren't based on the "base_version" -- it's been a while since I checked. But in theory we could give the user more detailed information than they had previously with this workflow.

Actions #33

Updated by dalley over 3 years ago

  • Related to Story #8673: Auto-publishing should be more fault-tolerant added
Actions #34

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 96 to Sprint 97
Actions #35

Updated by dalley over 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley
Actions #36

Updated by dalley over 3 years ago

  • Status changed from ASSIGNED to POST

Added by dalley over 3 years ago

Revision 7f4eb514 | View on GitHub

mirror=True will perform "metadata mirroring"

A sync with mirror=True will automatically create a publication using the existing metadata downloaded from the original repo, keeping the repository signature intact.

re: #6353 https://pulp.plan.io/issues/6353

Actions #38

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 97 to Sprint 98
Actions #39

Updated by dalley over 3 years ago

  • Status changed from POST to MODIFIED

Added by dalley over 3 years ago

Revision 8456d01a | View on GitHub

Fix an oversight in mirrored metadata publishing

re: #6353

Actions #40

Updated by dalley over 3 years ago

  • Related to Story #8856: As a user, I have a convenient UX for mirroring repositories added
Actions #41

Updated by ggainey over 3 years ago

  • Sprint/Milestone changed from Pulp 3.x RPM (Katello 4.1) to 3.13.0
Actions #42

Updated by pulpbot over 3 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF