Story #5625: Typed Repositories

Story #5625

h2. Primary Problem Statement 

 As discussed in #3541 [0] we need to ensure that each "RepositoryVersion" is a valid repository for each of the types of content it contains.    This is a hard problem, and the solutions proposed so far involve calling a handler defined by each plugin, that would need to be detected somehow at creation time based upon the content in the repo. 

 This would be not be so complicated if it only had to be considered at sync time because only one handler would need to run at a time (the one for the plugin being used presently), but we cannot do that, because you can create repository versions by manually adding and/or removing content. Invalid repositories should be impossible to create then, as well. 

 This problem can thus be summarized as: repository versions need to be validated for correctness at creation time no matter what method is being used to create them and what plugins/content types are involved. 

 h2. Other Problems 

 Some features of Pulp do not mesh properly with generically-typed repositories, with the most clear example being mirror=True.    A sync that uses mirror=True creates a repository version that is an exact "mirror" of the repository pointed at by the remote used, including removing content that is not present. The unfortunate consequence of this is that because this happens declaratively at the content level, "content that is not present" includes content from other plugins, e.g. Python content is "not present" in an RPM repo, so Python content in the same repo would be wiped away if performing a mirror sync    with an RPM remote (and this applies with any plugins, not just the two examples provided). 

 This is unintuitive and would be an extremely sharp edge for a user to cut themselves on, but there is no obvious solution with the current architecture. 

 Similarly, a lot of features we might want to implement such as "keep at most N versions of any package, discard older ones" might make sense when appied to RPM or Debian repositories but make much less sense when applied to e.g. a Python repository. So, a user mixing different content types into the same repository could lead to a sub-par experience down the line if/when they wanted you want to do more advanced things with Pulp, and we can't predict ahead of time what cross-plugin problems theoretical future features might cause. Disallowing this from the start might save users and ourselves a great deal of pain. Pulp. 

 A final note: In various interactions with the community when we have asked about multi-content type Repos, no users have expressed the desire to actually use this feature. It seems like most users would rather create e.g. one repository for RPM content, one for Docker content, one for Ansible content, etc. and of course that is how it will be used by most stakeholders e.g. Katello. So it doesn't seem that there is a great reason to maintain this feature (generic repositories) if it causes us problems. 

 h2. Proposal 

 The simplest way to fix this is to make repositories type-specific. This would work because if only one plugin ever interacts with the content in a given repository version, the validation that needs to be done and how it is triggered can be very straightforwards.   

  

 Example workflow: 

 <pre> 
 # Create a Repository 
 http POST :8000/pulp/api/v3/repositories/rpm/rpm/ name=$REPO_NAME 

 # Modify a Repository (create a new version by adding or removing content units) 
 # base_version optional, if not provided, latest version is used 
 http POST :8000/pulp/api/v3/repositories/rpm/rpm/1cb9d608-5d46-4569-8002-40ef8901905a/modify/ base_version=$BASE_VERSION_HREF add_content_hrefs=[$HREF_1, $HREF_2] remove_content_hrefs=[$HREF_3] 
 </pre> 

 In this proposal, each plugin would define at least one new model subclassing Repository, one new viewset subclassing RepositoryViewset, and one new serializer subclassing RepositorySerializer. The subclassed Repository model would have a list of the content types it creates so that stages in the Content pipeline have knowledge about what types they are dealing with. RepositoryVersions under this scheme will be created and validated by methods on the Repository class (and only through those means) at RepositoryVersion creation time, although the precise API for validating repository version contents will be decided in #3541 [0].  

 h2. Additional problems this could help solve / positive changes it enables 

 The stages pipeline needs to make queries for content units, and it needs to know what types are used. So we currently need to get the type() of each and every content unit to build up the list of types. This might be able to be cleaned up if we know the possible types ahead of time, although the specifics will need to be worked out. 

 With the current architecture, because a repository can contain multiple types of content we call the /remotes/plugin/type/uuid/sync/ endpoint and provide a repository for which a new version should be created. We have to do this because the sync code is defined outside of core while the repositories endpoint is defined inside of core and because we wanted to avoid "hooks" from core into plugins. This is a little weird from the user perspective. We're not really able to make it perfectly "RESTful" no matter what, but an action endpoint that syncs a repository makes more sense than syncing a remote. 

 If repositories only handle one plugin's content types, and if the plugin has control over the endpoints, then there is no longer any need for this arrangement. The sync API could look like this: 

 <pre> 
 # Sync a Repository (creates a new version using the given remote) 
 http POST :8000/pulp/api/v3/repositories/rpm/rpm/1cb9d608-5d46-4569-8002-40ef8901905a/sync/ :8000/pulp/api/v3/repositories/rpm/rpm/sync/ remote=$REMOTE_HREF 

 # Modify a Repository (create a new version by adding or removing content units) 
 # base_version optional, if not provided, latest version is used 
 http POST :8000/pulp/api/v3/repositories/rpm/rpm/modify/ base_version=$BASE_VERSION_HREF add_content_hrefs=[$HREF_1, $HREF_2] remove_content_hrefs=[$HREF_3] 
 </pre> 

 Plugins could also add metadata to their repository models, such as RPM metadata gpg signing keys (I don't know if that would make more sense to put on a publisher, but it was suggested). 

 Lastly, since nearly all other Pulp objects are typed, there's some value to the consistency of making this one as well. 

 h2. Optional additional API 

 With this change, there would be no way to list all repositories of all types. We should investigate whether we can have a GET /pulp/api/v3/repositories/ return a list of all repository names and their type. This is not very necessary but it would be nice to have as it would fill a small gap created by this change. 

 I put it here for discussion but we should address it as part of a separate task if we decide it should be done. 

 [0] https://pulp.plan.io/issues/3541
Back
Project

Profile

Help

Pulp

Story #5625