Project

Profile

Help

Story #5625

Typed Repositories

Added by dalley about 1 month ago. Updated 23 days ago.

Status:
MODIFIED
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
No
Sprint Candidate:
Yes
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

Primary Problem Statement

As discussed in #3541 [0] we need to ensure that each "RepositoryVersion" is a valid repository for each of the types of content it contains. This is a hard problem, and the solutions proposed so far involve calling a handler defined by each plugin, that would need to be detected somehow at creation time based upon the content in the repo.

This would be not be so complicated if it only had to be considered at sync time because only one handler would need to run at a time (the one for the plugin being used presently), but we cannot do that, because you can create repository versions by manually adding and/or removing content. Invalid repositories should be impossible to create then, as well.

This problem can thus be summarized as: repository versions need to be validated for correctness at creation time no matter what method is being used to create them and what plugins/content types are involved.

Other Problems

Some features of Pulp do not mesh properly with generically-typed repositories, with the most clear example being mirror=True. A sync that uses mirror=True creates a repository version that is an exact "mirror" of the repository pointed at by the remote used, including removing content that is not present. The unfortunate consequence of this is that because this happens declaratively at the content level, "content that is not present" includes content from other plugins, e.g. Python content is "not present" in an RPM repo, so Python content in the same repo would be wiped away if performing a mirror sync with an RPM remote (and this applies with any plugins, not just the two examples provided).

This is unintuitive and would be an extremely sharp edge for a user to cut themselves on, but there is no obvious solution with the current architecture.

Similarly, a lot of features we might want to implement such as "keep at most N versions of any package, discard older ones" might make sense when appied to RPM or Debian repositories but make much less sense when applied to e.g. a Python repository. So, a user mixing different content types into the same repository could lead to a sub-par experience down the line if/when they wanted to do more advanced things with Pulp, and we can't predict ahead of time what cross-plugin problems theoretical future features might cause. Disallowing this from the start might save users and ourselves a great deal of pain.

A final note: In various interactions with the community when we have asked about multi-content type Repos, no users have expressed the desire to actually use this feature. It seems like most users would rather create e.g. one repository for RPM content, one for Docker content, one for Ansible content, etc. and of course that is how it will be used by most stakeholders e.g. Katello. So it doesn't seem that there is a great reason to maintain this feature (generic repositories) if it causes us problems.

Proposal

The simplest way to fix this is to make repositories type-specific. This would work because if only one plugin ever interacts with the content in a given repository version, the validation that needs to be done and how it is triggered can be very straightforwards.

Example workflow:

# Create a Repository
http POST :8000/pulp/api/v3/repositories/rpm/rpm/ name=$REPO_NAME

# Modify a Repository (create a new version by adding or removing content units)
# base_version optional, if not provided, latest version is used
http POST :8000/pulp/api/v3/repositories/rpm/rpm/1cb9d608-5d46-4569-8002-40ef8901905a/modify/ base_version=$BASE_VERSION_HREF add_content_hrefs=[$HREF_1, $HREF_2] remove_content_hrefs=[$HREF_3]

In this proposal, each plugin would define at least one new model subclassing Repository, one new viewset subclassing RepositoryViewset, and one new serializer subclassing RepositorySerializer. RepositoryVersions under this scheme will be created and validated by methods on the Repository class (and only through those means) at RepositoryVersion creation time, although the precise API for validating repository version contents will be decided in #3541 [0].

Additional problems this could help solve / positive changes it enables

The stages pipeline needs to make queries for content units, and it needs to know what types are used. So we currently need to get the type() of each and every content unit to build up the list of types. This might be able to be cleaned up if we know the possible types ahead of time, although the specifics will need to be worked out.

With the current architecture, because a repository can contain multiple types of content we call the /remotes/plugin/type/uuid/sync/ endpoint and provide a repository for which a new version should be created. We have to do this because the sync code is defined outside of core while the repositories endpoint is defined inside of core and because we wanted to avoid "hooks" from core into plugins. This is a little weird from the user perspective. We're not really able to make it perfectly "RESTful" no matter what, but an action endpoint that syncs a repository makes more sense than syncing a remote.

If repositories only handle one plugin's content types, and if the plugin has control over the endpoints, then there is no longer any need for this arrangement. The sync API could look like this:

# Sync a Repository (creates a new version using the given remote)
http POST :8000/pulp/api/v3/repositories/rpm/rpm/1cb9d608-5d46-4569-8002-40ef8901905a/sync/ remote=$REMOTE_HREF

Plugins could also add metadata to their repository models, such as RPM metadata gpg signing keys (I don't know if that would make more sense to put on a publisher, but it was suggested).

Lastly, since nearly all other Pulp objects are typed, there's some value to the consistency of making this one as well.

Optional additional API

With this change, there would be no way to list all repositories of all types. We should investigate whether we can have a GET /pulp/api/v3/repositories/ return a list of all repository names and their type. This is not very necessary but it would be nice to have as it would fill a small gap created by this change.

I put it here for discussion but we should address it as part of a separate task if we decide it should be done.

[0] https://pulp.plan.io/issues/3541


Related issues

Blocks Pulp - Issue #3541: Core should not add/remove content to a repository or create a repository_version without plugin input MODIFIED Actions
Blocks Pulp - Task #5627: Remove plugin managed repos MODIFIED Actions

Associated revisions

Revision 7440aea8 View on GitHub
Added by dalley 29 days ago

Typed Repositories and requirements files split

re: #5625
https://pulp.plan.io/issues/5625

Revision c65d1581 View on GitHub
Added by dalley 26 days ago

Nested views: Change from generic subclass views to specific views

Required PR: https://github.com/pulp/pulp_file/pull/299
Required PR: https://github.com/PulpQE/pulp-smash/pull/1223

re: #5625
https://pulp.plan.io/issues/5625

History

#1 Updated by dalley about 1 month ago

  • Blocks Issue #3541: Core should not add/remove content to a repository or create a repository_version without plugin input added

#2 Updated by gmbnomis about 1 month ago

This is great news!

I think the sync/modify URLs would include the typed repo:

http POST :8000/pulp/api/v3/repositories/rpm/rpm/1cb9d608-5d46-4569-8002-40ef8901905a/sync/ remote=$REMOTE_HREF

and

http POST :8000/pulp/api/v3/repositories/rpm/rpm/1cb9d608-5d46-4569-8002-40ef8901905a/modify/ base_version=$BASE_VERSION_HREF add_content_hrefs=[$HREF_1, $HREF_2] remove_content_hrefs=[$HREF_3]

The actual version endpoint would be /pulp/api/v3/repositories/rpm/rpm/1cb9d608-5d46-4569-8002-40ef8901905a/versions/ right?
I am wondering whether the sync / modify actions should be below versions (since POSTs to these URLs create new versions, not repos).

#3 Updated by dalley about 1 month ago

  • Description updated (diff)

@gmbnomis, you're correct, I forgot to fix that up when I was writing this on the plane :)

#4 Updated by daviddavis about 1 month ago

This looks great. Would the RepositoryViewset just define CRUD and all the other actions (e.g. sync, modify, etc) would be defined in the plugins?

We should investigate whether we can have a GET /pulp/api/v3/repositories/ return a list of all repository names and their type.

I would say that we should at least postpone this until after 3.0 GA or until a user asks for this feature. I could imagine that some plugins would want to have hidden repos that they don't want to be displayed to the user (eg kickstart repos). Or also, plugins might want to control how the repos are displayed to the user.

#5 Updated by dalley about 1 month ago

@David, yeah. We could also have the core class define the endpoints too, but have them call straight to methods that the plugin writer has to implement, which would reduce a little bit of boilerplate. Not sure if that's a good idea since I think it would make customizing the endpoints a little more difficult, but it would reduce boilerplate.

Another thing we need to figure out -- the remote will need to be type-checked.

#6 Updated by daviddavis about 1 month ago

  • Blocks Task #5627: Remove plugin managed repos added

#8 Updated by bmbouter about 1 month ago

@dalley and I have been collaborating on this. Here's a recap of the challenges we have run into in bringing this change to an acceptable conclusion. tl;dr the challenges are outweighing the benefits to users and plugin writers.

The specific issues are all derivatives of a general problem: we wanted to move the Repository Model, its Viewset, and its Serializer to be plugin owned, but we don't want to move RepositoryVersion Model, RepositoryVersionViewset, RepositoryVersionSerializer to the plugins.

Why don't we want to move the RepositoryVersion also? In short, because it's creating more burden on the plugin writer without providing a benefit to either them or the plugin writer or the user. For example, we specifically don't want to RepositoryVersion to have Master/Detail tables or be customizable. Users benefit from having RepositoryVersion all be the same.

Here are the specific issue preventing us from continuing down this prototype:

1) RepositoryVersion's is one viewset, with one name, but we need to give it a name per type-repository. @dalley and I prototyped a workaround in this area, but I don't think either of us were comfortable accepting it, although it did work.

2) There are various other core machinery that needs to refer to these per type-repository and if RepositoryVersion has many names you end up having to split those objects up into one for each type.

For some things like RepositoryVersionSerializer the name here could be moved into the plugins but this would have the downsides stated above (we didn't want Master/Detail RepositoryVersions, just Repositories)

In other cases though, it's extremely challenging, specifically two:

https://github.com/pulp/pulpcore/blob/342f4539c09a7accda7d2ffaaea7666b1dac0463/pulpcore/app/serializers/fields.py#L187
https://github.com/pulp/pulpcore/blob/2a62d16b537757a951d41cfafa32c5a5a2d63c52/pulpcore/app/serializers/publication.py#L22

To move ^ also to plugins would be akin to "gutting core" which is a non-starter when asking anyone.

#9 Updated by bmbouter about 1 month ago

Could we still adopt typed repositories, but with an alternate design?

tl;dr not in a way that doesn't significantly and negatively affect users and/or plugin writers

We considered two alternatives.

1) Move All The Things to plugins and have more Master/Detail objects them maintain. We determined this is a non starter due to concerns described here

2) Unnest RepositoryVersion from underneath Repository.

This would have a typed Master/Detail repository still, e.g. pulp_file at /pulp/api/v3/repositories/file/file/1cb9d608-5d46-4569-8002-40ef8901905a/ and /pulp/api/v3/repositories/file/file/ but /pulp/api/v3/repositories/file/file/1cb9d608-5d46-4569-8002-40ef8901905a/versions/ would no longer be underneath. Instead all versions of all repositories (all types) would be available at /pulp/api/v3/repository_verisons/.

You would then query for all RepositoryVersions using /pulp/api/v3/repository_versions/.
You would query for all RepositoryVersions for a single Repository using /pulp/api/v3/repository_versions/ repository=/pulp/api/v3/repositories/file/file/1cb9d608-5d46-4569-8002-40ef8901905a/. This url could be provided for the user in the Repository serializer versions_href also to ease their construction of it.

We determined we don't want to do this for usability of reasons below.

1) Loosing the the compositional nature of /versions/ under the repositories endpoint is a loss for users.
2) RepositoryVersions from different repositories mixed together doesn't conceptually make sense. For example you'd have many version=1 repos there.

What do we loose by not doing this?

The original motivation was to accomplish https://pulp.plan.io/issues/3541 which allowed plugin writers to provide content modification/validation along with typed repositories. We're going to resume the discussion of design solutions there instead and close this as WONTFIX.

More feedback welcome

If this decision is not the best, or there is something we haven't considered please post on it and we could reopen it.

#10 Updated by gmbnomis about 1 month ago

bmbouter wrote:

2) There are various other core machinery that needs to refer to these per type-repository and if RepositoryVersion has many names you end up having to split those objects up into one for each type.

For some things like RepositoryVersionSerializer the name here could be moved into the plugins but this would have the downsides stated above (we didn't want Master/Detail RepositoryVersions, just Repositories)

In other cases though, it's extremely challenging, specifically two:

https://github.com/pulp/pulpcore/blob/342f4539c09a7accda7d2ffaaea7666b1dac0463/pulpcore/app/serializers/fields.py#L187
https://github.com/pulp/pulpcore/blob/2a62d16b537757a951d41cfafa32c5a5a2d63c52/pulpcore/app/serializers/publication.py#L22

To move ^ also to plugins would be akin to "gutting core" which is a non-starter when asking anyone.

This are all problems within serialization, right? Could we implement serializer fields that essentially do this https://github.com/pulp/pulpcore/blob/9b2ce1d711ed7b0833bc8932040521eecab886f4/pulpcore/app/models/repository.py#L694-L698 to compute the href? This is not beautiful, but currently I don't see why it should not work. (after all, we already have a custom serializer field for the "latest" href).

Or is this only part of the problem?

#11 Updated by bmbouter about 1 month ago

gmbnomis wrote:

This are all problems within serialization, right? Could we implement serializer fields that essentially do this https://github.com/pulp/pulpcore/blob/9b2ce1d711ed7b0833bc8932040521eecab886f4/pulpcore/app/models/repository.py#L694-L698 to compute the href? This is not beautiful, but currently I don't see why it should not work. (after all, we already have a custom serializer field for the "latest" href).

Or is this only part of the problem?

This is more or less what @dalley and I were looking into doing. It didn't come easy though. I agree with you, we probably could make it work with more time, but the risk/reward is the issue for me.

1) The "not beautiful" solution you mention is another on top of several already "not beautiful" hacks solutions @dalley and I already had to put in place. Which leads me to the idea that...

2) DRF wasn't meant to be used this way. I don't have good evidence for this, but with each workaround we put in place Pulp's code takes more responsibility and DRF takes less. This makes me uncomfortable.

3) These things will appear magical for plugin writers. For example we would have to use runtime generated names which obfuscates the functionality to plugin writers. I want to Pulp to be very unsurprising for plugin writers. I feel this design would take us further from the principle of least surprise.

If the benefit was very large, it would offset these concerns (for me). I don't see the benefit to be that large other than what got us into the discussion with #3541. In fact, I think having repositories spread out across the API makes for a not-as-great user experience which is an additional usability concern I have.

After many days of effort from several folks, this path feels like one challenge after the next. If someone showed me fully working PRs I might feel differently.

@gmbnomis, given all ^ what do you think we should do?

#12 Updated by gmbnomis about 1 month ago

bmbouter wrote:

After many days of effort from several folks, this path feels like one challenge after the next. If someone showed me fully working PRs I might feel differently.

Challenge accepted. https://github.com/gmbnomis/pulpcore/tree/typed-repositories is a PoC that looks promising (on top of @dalley latest public branch). This isn't a PR and it is not really tested. I did some manual verification. The serializers do not depend on unique view names for the RepositoryVersion URLs.

This is just the first pass, the code needs beautification in some places. And of course tests and docs.

gmbnomis wrote:

This are all problems within serialization, right? Could we implement serializer fields that essentially do this https://github.com/pulp/pulpcore/blob/9b2ce1d711ed7b0833bc8932040521eecab886f4/pulpcore/app/models/repository.py#L694-L698 to compute the href? This is not beautiful, but currently I don't see why it should not work. (after all, we already have a custom serializer field for the "latest" href).

Or is this only part of the problem?

This is more or less what @dalley and I were looking into doing. It didn't come easy though. I agree with you, we probably could make it work with more time, but the risk/reward is the issue for me.

1) The "not beautiful" solution you mention is another on top of several already "not beautiful" hacks solutions @dalley and I already had to put in place. Which leads me to the idea that...

Hmm, looking at the result, it isn't as bad as expected. LatestVersionField got even simpler.

2) DRF wasn't meant to be used this way. I don't have good evidence for this, but with each workaround we put in place Pulp's code takes more responsibility and DRF takes less. This makes me uncomfortable.

The implementation is based on https://www.django-rest-framework.org/api-guide/relations/#custom-hyperlinked-fields. I think the nested serializers create a lot of complexity (we could even remove them, they seem not to be used in core except for the version href fields).

But I agree to you general feeling. What still makes me uncomfortable is that we have multiple URL patterns pointing to the same view name. Thus, a plain .reverse does not work as expected. But I think that's acceptable.

3) These things will appear magical for plugin writers. For example we would have to use runtime generated names which obfuscates the functionality to plugin writers. I want to Pulp to be very unsurprising for plugin writers. I feel this design would take us further from the principle of least surprise.

In the end, I find:

    repository_version = RepositoryVersionRelatedField(
        required=False
    )

quite unsurprising for the plugin writer. At least less so than the NestedRelatedField monsters.

If the benefit was very large, it would offset these concerns (for me). I don't see the benefit to be that large other than what got us into the discussion with #3541. In fact, I think having repositories spread out across the API makes for a not-as-great user experience which is an additional usability concern I have.

A usability concern I have: Having a bunch of untyped repos for which the user has to look into the respective latest repo version in order to find out what might be in there.

We could implement an endpoint that lists all typed repository hrefs to address your concern (this could probably be added later). And if we do so, a user will know the repo type right away.

After many days of effort from several folks, this path feels like one challenge after the next. If someone showed me fully working PRs I might feel differently.

At least you don't fall victim to the sunk cost fallacy easily... I don't know yet if what I did is fully working. But perhaps it is enough for you to feel differently now(?)

#13 Updated by dalley 26 days ago

  • Status changed from POST to MODIFIED

#14 Updated by daviddavis 24 days ago

  • Status changed from MODIFIED to POST

Temporarily setting back to POST to run https://github.com/pulp/pulp_ansible/pull/250

#15 Updated by dalley 23 days ago

  • Status changed from POST to MODIFIED

Please register to edit this issue

Also available in: Atom PDF