Project

Profile

Help

Story #7469

closed

as a Pulp File user, I can have my repository auto published and distributed

Added by dkliban@redhat.com over 3 years ago. Updated about 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Estimated time:
(Total: 0:00 h)
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 94
Quarter:
Q1-2021

Description

As a CLI user or a Web UI user, I would like to be able to sync/publish/distribute a repository with a single operation.

In order to support such a feature, the user needs to be able to associate a File Distribution with a File Repository. This would be a new attribute of a File Repository.

The repository sync API needs to accept an additional parameter called 'auto_distribute'. When such a parameter is supplied, the sync task needs to create a new repository version, a new publication, and update the File DIstribution with the latest repository version.


Sub-issues 1 (0 open1 closed)

Story #8140: As a user, I can distinguish which "action" progress report(s) come from within a task that performs multiple actionsCLOSED - WONTFIXdalley

Actions

Related issues

Related to RPM Support - Story #7622: as a Pulp Rpm user, I can have my repository synced, auto published and distributedCLOSED - CURRENTRELEASEdalley

Actions
Related to Pulp - Story #7626: When modifying a repository content, I want to have my repository auto published and distributedCLOSED - CURRENTRELEASEdalley

Actions
Related to File Support - Task #8548: Document auto-publish / distributeCLOSED - CURRENTRELEASEdalley

Actions
Blocked by Pulp - Story #7815: As a plugin writer, pulpcore ensures that a job working directory is set/removed properlyCLOSED - CURRENTRELEASEdalley

Actions
Actions #1

Updated by dkliban@redhat.com over 3 years ago

  • Project changed from Pulp to File Support
Actions #2

Updated by bmbouter over 3 years ago

One question I have is about whether the sync and publish will occur as one task or two. From a user, usability perspective having a single task to monitor I think would be easier. Functionally speaking both tasks require a lock on the repository right? If so, then having them run as one task is equally efficient to having them run as two.

Actions #3

Updated by dkliban@redhat.com over 3 years ago

This will run in a single task. The user will monitor a single task. You are correct that both sync and publish lock on a repository.

Actions #4

Updated by daviddavis over 3 years ago

Yea, I think it has to be a single task for the Web UI. Otherwise, if I kick off a sync-and-publish task and then close my browser, then the publish never happens.

Actions #5

Updated by daviddavis over 3 years ago

  • Quarter set to Q1-2021
Actions #6

Updated by daviddavis over 3 years ago

Updated quarter based on last 3-month planning meeting.

Actions #7

Updated by dkliban@redhat.com over 3 years ago

Users would like to be able to specify at sync time whether the spawned task should only sync, sync and publish, or sync, publish, and distribute.

Actions #9

Updated by daviddavis over 3 years ago

  • Related to Story #7622: as a Pulp Rpm user, I can have my repository synced, auto published and distributed added
Actions #10

Updated by dalley over 3 years ago

@dkliban

So let's say at sync time I want to publish and distribute, also. How does it decide what publication settings to use? Do we need to accept those at sync time also?

What about distribution base path?

Actions #11

Updated by dkliban@redhat.com over 3 years ago

dalley wrote:

@dkliban

So let's say at sync time I want to publish and distribute, also. How does it decide what publication settings to use? Do we need to accept those at sync time also?

We have two options for this:

  1. Add 'publication_options' to the FileRepository model so the user only provides them once. This could be multiple fields or json object.
  2. Require the user to submit publication options with each sync request.

The first option seems more user friendly. What do you think?

What about distribution base path?

The user would create a distribution first and then create a repository with a reference to that distribution. So the base_path would always be the one assigned to the distribution.

Actions #12

Updated by dalley over 3 years ago

The first option seems more user friendly. What do you think?

I think we're reimplementing Pulp 2 piece by piece lol. Not that it's the wrong strategy to take. I agree that it's probably simpler, it just leaves me questioning whether we should have done a few things differently from the beginning.

The user would create a distribution first and then create a repository with a reference to that distribution. So the base_path would always be the one assigned to the distribution.

This is a new relationship being created, right? I don't see it in the existing models.

Actions #13

Updated by daviddavis over 3 years ago

I think we're reimplementing Pulp 2 piece by piece lol. Not that it's the wrong strategy to take. I agree that it's probably simpler, it just leaves me questioning whether we should have done a few things differently from the beginning.

TBH, I was not a fan of when we stopped storing sync/publish options in the database. It requires users to somehow remember and recall them every time they sync/publish.

This is a new relationship being created, right? I don't see it in the existing models.

That's correct. The question is whether it belongs on the Repository model or on the FileRepository (ie in pulpcore or pulp_file).

ETA: Actually it needs to go on FileRepository since we have two different types of distributions.

Actions #14

Updated by dalley over 3 years ago

Does that mean that we need to implement it separately on the repository models for every plugin? : /

Maybe this is totally crazy but why not on the distribution? In other words, turn the simple case into a "pull model" where the user just "refreshes" the particular distribution and everything else happens in the background?

Or else create a way to describe a repository, remote, publish setting, distribution combination and make anything that is part of such a thing auto-publish & distribute.

It seems cleaner than adding all these relationships between components that are supposed to be separate. We already have these relationships:

  • distribution -> publication
  • distribution -> repository version
  • publication -> repository version

If we add

  • repository version -> distribution

then we're going to end up with a potential reference cycle (distribution -> publication -> repository version -> distribution) and that code-smell is terrible.

edit: I guess you meant repository -> distribution instead of repository version -> distribution?

Actions #15

Updated by dalley over 3 years ago

Let's say there was a separate piece of database metadata we could create that linked together:

  • A repository
  • A remote + sync settings (optional)
  • publish settings
  • A distribution

If a repository was part of one of these (let's say it can only be part of one) then new repository versions are automatically published & distributed. We could also add a "refresh" endpoint that only works when all these things are true, which would avoid needing to keep track of the remote all the time.

And then we don't need to keep so many references between the pieces themselves.

Actions #16

Updated by dalley over 3 years ago

How does RHUI do this?

Actions #17

Updated by daviddavis over 3 years ago

dalley wrote:

Does that mean that we need to implement it separately on the repository models for every plugin? : /

Maybe this is totally crazy but why not on the distribution? In other words, turn the simple case into a "pull model" where the user just "refreshes" the particular distribution and everything else happens in the background?

Or else create a way to describe a repository, remote, publish setting, distribution combination and make anything that is part of such a thing auto-publish & distribute.

It seems cleaner than adding all these relationships between components that are supposed to be separate. We already have these relationships:

  • distribution -> publication
  • distribution -> repository version
  • publication -> repository version

If we add

  • repository version -> distribution

then we're going to end up with a potential reference cycle (distribution -> publication -> repository version -> distribution) and that code-smell is terrible.

edit: I guess you meant repository -> distribution instead of repository version -> distribution?

Yea, we'd need to implement it separately for every plugin. And you're correct: the repository would point to the distribution. I'm not opposed to putting the relationship on distribution. In fact, that way a repository sync could auto-distribute to more than one distribution on sync.

For this idea of "refreshing" distributions, what would the API look like? Would users still call repo sync or would they be calling some distribution endpoint?

dalley wrote:

Let's say there was a separate piece of database metadata we could create that linked together:

  • A repository
  • A remote + sync settings (optional)
  • publish settings
  • A distribution

If a repository was part of one of these (let's say it can only be part of one) then new repository versions are automatically published & distributed. We could also add a "refresh" endpoint that only works when all these things are true, which would avoid needing to keep track of the remote all the time.

And then we don't need to keep so many references between the pieces themselves.

I agree that having these references between objects is not ideal. The one obstacle I see is that the connection between repo and remote already exists and I'm not sure how we can change that in a backwards compatible way. Would another option be to store this information on the repository? Perhaps this might be a solution until Pulp 4 when we can implement this new object.

dalley wrote:

How does RHUI do this?

This is a good question and perhaps worth checking with RHUI on.

Actions #18

Updated by daviddavis over 3 years ago

  • Related to Story #7626: When modifying a repository content, I want to have my repository auto published and distributed added
Actions #19

Updated by dkliban@redhat.com over 3 years ago

Based on the discussion above, the implementation looks like this:

The FileRepository model will have four additional fields: manifest, mirror, auto_publish, auto_distribute.

The File Repository Sync API will accept two additional parameters: auto_publish and auto_distribute. The auto_distribute field will only be able to be True if auto_publish is also True.

The File Distribution model will have one additional field: repository.

The sync task will be updated to consider the new FileRepository fields when performing a sync. If auto_publish is enabled on the repository or in the one time sync options, the sync task will create a publication. If the auto_distribute is enabled, the sync task will update all the File Distributions that reference this repository with the publication that was created. This API will verify that there is at least one File Distribution to update if auto_distribute is set to True. If there is no File Distribution to update, the user will be notified and no task will be started. The task will lock on the File Repository and all the File Distributions that need to be updated.

The workflow for the user that wants to use auto distribution will be the following:

 - Create a repository and set auto_distribute to True.  - Create a distribution and set the repository to the pulp_href from the previous step.  - Sync the repository.

The workflow for the user that wants to auto publish will be the following:

 - Create a repository and set auto_publish to True.  - Sync the repository.

The workflow for a user that wants to only create a new repository version for a repository that has auto_distribute enabled:

 - Create a repository and set auto_distribute to True.  - Create a distribution and set the repository to the pulp_href from the previous step.  - Sync the repository and specify auto_publish as False and auto_distribute as False.

Actions #20

Updated by daviddavis over 3 years ago

That sounds good to me. A couple thoughts:

  • I think mirror belongs on the remote since it's specific to syncing
  • I had imagined that auto_publish and auto_distribute would be specified each time calling sync but I have no objections to putting it on the model
  • I think manifest is a publish-specific option and should maybe not live on the Repository model. I'm imagining that plugins might have lots of publish options that they might want to store and share across repos so we might want to consider a separate object for this.
Actions #21

Updated by dalley over 3 years ago

I think mirror belongs on the remote since it's specific to syncing

It doesn't have to be specific to syncing. We're about to add mirrored metadata, so "mirror" can imply "this repository cannot be manually edited, and mirror sync is implied by the presence". We need a way to toggle the metadata mirroring behavior at publish-time anyways (granted, not for the file plugin). So it would serve triple-duty and resolve all of those issues.

This would be a superset of the behavior of the sync "mirror" parameter, but I think it matches the use case well. If a user wants to set "mirror" on their repository so that every sync would be a mirror sync, wanting to keep a forever-exact clone is probably what they want regardless, so the semantics difference isn't necessarily a bad thing. The only place it gets dicey is that "mirror" on the repository doesn't make a tooooon of sense unless the remote being mirrored is part of the repository also...

This is specific to the "mirror" parameter, just to be clear, I don't think we should be putting put 100% sync-specific settings on the repository, same with publish settings.

I think manifest is a publish-specific option and should maybe not live on the Repository model. I'm imagining that plugins might have lots of publish options that they might want to store and share across repos so we might want to consider a separate object for this.

I do agree with this. I'd also add that if we kept the publish settings in a separate object, we could eventually transition to making them copy-on-write, and each Publication could therefore know both the repository version and publish settings it was created from. That would really help auditability, and also potentially be leveraged to avoid re-doing work. We could skip re-publishing if a publication with identical inputs already exists.

At the very least, I think it should be in a nested serializer, so that they are kept separate from the user perspective, even if we would store the data on the repository for now.


I do have some other minor problems with this proposal though.

  • I actually think we should have Repository reference BaseDistribution rather than the other way around. I know I suggested this earlier but it only really makes sense if we're going to simultanously support push and pull workflows and I don't really think we should attempt that anymore.
  • I think we should reduce the immediate scope of our implementation down to the strict MVP we need - which is, auto publish + distribute together. We can make it more configurable later (although I don't mind creating a solid plan on how to do that in the future).
    • So if we do that, the simple existence of the (nullable) FK from repository -> base distribution, implies auto-distribute, implies auto-publish.
      • The toggles we add later can turn it back off.
    • UX improvement: We can auto-create Distributions directly via the Repository viewsets. When you're setting up the repository, give it the base path -- if a distribution with an overlapping base path exists already, fail, otherwise, create it automatically.
      • If we follow that same pattern with sync & publish settings then we're back to being able to set Pulp up to mirror a repository in a single request.

In workflow terms, what I would like to see is:

Create a repository, providing the distribution + publish settings. If distribution is provided but publish settings are not, then it fails to validate (but registry-oriented publication-less plugins should work because they have no need for publish settings).

If possible it would be great if you could either specify the actual publish settings & distribution base path in a nested fashion in the request, OR the href of an existing one. I think DRF could handle that. And we should use get-or-create to do copy-on-write so that we don't end up with duplicate settings objects.


In terms of code changes:

Repository:
   distributes_to (BaseDistribution)
   publish_settings (BasePublishSettings)
   mirror (bool)

FileRepository(Repository)
    < no change>
   
BasePublishSettings(MasterModel):
   < empty >

FilePublishSettings(BasePublishSettings):
   manifest (str)

auto_publish & auto_distribute could then be added later (if still desired) to restrict how much is done by default.

Also I'm not against skipping the model inheritance and just using a json field for the publish settings, it would make everything easier. Maybe we'd out-grow it eventually but at that point in time we could do a migration.

Actions #22

Updated by daviddavis over 3 years ago

Your proposal sounds mostly good to me.

A couple thoughts: moving the distribution relationship to the repository means that users can't auto-distribute to multiple distributions. I don't have an objection but it's worth calling out.

I'm still dubious about putting mirror on repository and actually I think it might be better to wait to do so when we address #6353. I don't think storing mirror is necessary for this feature. I'm happy to talk more about my concerns but basically it seems that there are two forms of mirroring: mirror syncing where only content is mirrored from the upstream repos during sync time and mirrored repos which can't be manually edited, mirror the upstream published data, etc. I think it's worth discussing how we differentiate the two and how they work together (or don't).

Actions #23

Updated by dalley over 3 years ago

A couple thoughts: moving the distribution relationship to the repository means that users can't distribute to multiple repos. I don't have an objection but it's worth calling out.

Fair. I didn't really think about that use case - admittedly, I'm not certain how common it would be. If we have a solid reason to support that use case then I'm fine with doing so.

I don't think storing mirror is necessary for this feature.

Also probably true, although we'll be implementing mirroring within the next few weeks anyway. No objection to holding off on it.

Actions #24

Updated by dkliban@redhat.com over 3 years ago

+1 to keeping things simple. If we are going to have users provide a base_path when creating a repository, then it makes sense to have the foreign key to a distribution on a repository and not the other way around. THe publish options should be their own object. However, the options should also be provided when repository is created or updated.

Actions #25

Updated by dalley over 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley
Actions #26

Updated by dalley over 3 years ago

Do we have any opinion on whether a simple JSONField to store the settings is good enough for now, or if we should use master model?

Actions #27

Updated by dalley over 3 years ago

Also, in terms of how the work is accomplished, should it all happen within one task? Multiple tasks? Multiple tasks associated with a task group?

Metadata-mirror sync will be multiple stages within one task, so moving in that direction might be appropriate.

Actions #28

Updated by dkliban@redhat.com over 3 years ago

This should be a single task.

Actions #29

Updated by dalley over 3 years ago

If it's one task, how do we want progress reporting to work?

  • We could just mix all of the reports together, but that seems pretty messy.
  • We could add a "stage" field and group reports by it, but that breaks the API.
  • Compromise, we could add a field named "stage" or "action" and sort (not group) by it, and let API consumers group on their own. Probably the best option in the short-term.
Actions #30

Updated by daviddavis over 3 years ago

Compromise, we could add a field named "stage" or "action" and sort (not group) by it, and let API consumers group on their own. Probably the best option in the short-term.

This seems like a good option. I'd maybe open a pulpcore issue and get some feedback from a couple more pulpcore devs.

Actions #32

Updated by bmbouter about 3 years ago

dalley and I met and here's a summary of what we talked about:

Create an AutoDistribute for Publications

I believe we should have PublicationDistribution get a new field like the one below:

repository = models.ForeignKey(Repository, null=True, on_delete=models.SET_NULL). This would be the same as is on RespotiroyVersionDistribution today.

Why wouldn't we add this new field to BaseDistribution instead? Because then it would be moving the repository field from Distributions that subclass RepositoryVersionDistribution from the detail table to the master table and we don't have enough time to do that right now. We can do that later.

We could instead have this sync+publish call flow also update the PublicationDistribution to point to the newly created subclass, but we should not for 2 reasons:

  1. It will cause an additional lock to be needed because anytime a distribution is updated it needs to be done in a locked way. Having the PublicationDistribution just know to distribute the latest publication of a repository requires no additional update or lock.

  2. It's a usable feature on its own, which doesn't bury auto-publication-distribution in a full sync->publish->distribute workflow.

I remember us talking about this idea in the past when we added the repository field to RepositoryVersionDistribution, but we determined we didn't need to at that time. I can't find a record of that, but I did find a link to a Debian issue asking for almost the same thing: https://pulp.plan.io/issues/5381

Have a Repository.post_save call provide the call to the plugins publish task as a synchronous call

Right now RespositoryVersion finalization hook provides plugin writers with a callback opportunity for them to define the callback on their Repository detail. Let's add another callback called post_save which could be called here.

Then on the Repository detail definition, e.g. this one in pulp_file they could define FileRepository.post_save. This would allow the plugin writer to call the publish task syncrhonously and with the options saved in the Repository.publish_options in the design already planned.

This would be good for a variety of reasons:

  • It provides auto-publication-creation anytime a RepositoryVersion is made, so this would happen automatically for both modify in addition to sync, etc.
  • It causes the publication to occur in the same transaction provided by the RepositoryVersion creation so it's atomic.
  • It doesn't need any extra locks so it's safe to do right there.
Actions #33

Updated by dalley about 3 years ago

  • Blocked by Story #7815: As a plugin writer, pulpcore ensures that a job working directory is set/removed properly added
Actions #34

Updated by pulpbot about 3 years ago

  • Status changed from ASSIGNED to POST
Actions #35

Updated by dalley about 3 years ago

  • Sprint set to Sprint 90
Actions #36

Updated by dalley about 3 years ago

So to recap the changes I made that we discussed earlier:

  • Created a "PublishSettings" model and associated it with the repository. Created a new endpoint /publish_settings/$plugin/$type where they can be created.
    • Each plugin subclasses the model, serializer, viewset to add their publish options (but only iif they use publications, container plugin e.g. doesn't need to).
  • Add a hook to the repository model, named "on_new_version", that functions similarly to "finalize_new_version". It's called just after new repository versions are created, inside the context manager (which means inside the same task).

Things I did that we didn't explicitly discuss earlier but ended up being necessary / good ideas:

  • sync tasks are now expected to return the repository version they created with DeclarativeVersion.create() and publish tasks are expected to return the publication. So this is a plugin API change. It's necessary to know whether or not to trigger the auto-actions, because if a sync doesn't end up creating a new repository version, then publish shouldn't happen, and in order to distribute the new publication we need to get it back from the publish.
  • Change the publish tasks so that they take the repository version PK and the publish settings PK and source all of their settings from the publish settings object. This makes it more consistent. Publishing without a publish settings object will get_or_create() a matching one.
  • Put a foreign key to PublishSettings on the Publication model.

Things I didn't do that we had discussed:

  • Add individual toggles for auto_publish and auto_distribute. They can be added later but the general feature doesn't require them. It's still something you need to somewhat opt into.
  • Make the PublishSettings and Distribution serializers nested inside the Repository serializer. It would be inconsistent with what we do for remotes, and also it's a major pain to do with DRF. Reading nested serializers is not so bad, but writing nested serializers is a lot more painful. And unlike HREFs we have no utilities to do this yet, and master-detail makes it nontrivial.
  • Distributions can't be listed from the repository endpoint at all, because "repository" is on the abstract Distribution classes. You can't work backwards from Repository -> BaseDistribution, you'd have to use explicit types.
Actions #37

Updated by daviddavis about 3 years ago

All of these changes sound good.

The downside: Migrating this properly requires that every plugin would need set a hard floor of 3.11 in the next release, and old plugins would also be incompatible with new Pulpcore.

So like the existing plugins that we've already released would break once 3.11 drops?

Actions #38

Updated by dalley about 3 years ago

So like the existing plugins that we've already released would break once 3.11 drops?

Yes, if they update pulpcore without updating plugins. But this is already the case for some plugins due to the FIPS changes, and maybe true regardless because of the "normal" auto-publish changes (at least without very gross hacks).

e.g. New pulpcore now assumes that you can set the "repository" on publication-based distributions, but since it's an abstract class there's no way to enforce that. The plugin might not be updated therefore you can't assume it's there. So then you have to hack the serializers to exclude that field if the plugin is too old. And that's a separate issue from this proposed extra change.

Using abstract models across app boundaries seems like a very very bad idea in hindsight.

Actions #39

Updated by bmbouter about 3 years ago

The deprecation policy guarantees that GA of current plugins (released with 3.10) can declare compatibility with 3.11. Whatever we do we need to ensure that stays true. I think that means we need a 1-cycle deprecation of some kind with this change. What can be done to have that be the case?

Actions #40

Updated by dalley about 3 years ago

So I don't want to waste more time if this isn't going to work out, I'm not going to push for it endlessly or anything. It's just a way to kill multiple birds with one stone, although a very heavy one. And I figured that if we have multiple other reasons to sync up the versions then it was worth suggesting.

But to answer your question, I don't think it's possible, practically or perhaps literally. Moving the data across apps is a 4 step process.

  • Create a new field on the core models (core)
  • Move the data from the plugin models to the new the new field on core (plugin)
  • Then the old field needs to be deleted on the plugin model (plugin)
  • Then the core field needs to be renamed to the original name (core)

The problem is you fundamentally can't separate step 3 from step 4, and one is in pulpcore and the other in plugin(s).

You have to say "this plugin migration needs to happen before this pulpcore migration but after this other pulpcore migration" which you can only do if the 2nd core migration hasn't been applied yet. So migrations 3 and 4 need to run at the same time, which means core and all the plugins would need to be upgraded at the same time.

Actions #41

Updated by dalley about 3 years ago

So the options are: don't do it, hack the distribution serializer to deal with the abstract class uncertainty problems, listed improvements aren't possible until Pulp 4 when we can enforce this without breaking policy.

Do it: no hacks, multiple issues improved at once, violate deprecation policy : /

I understand if it's impossible.

Actions #42

Updated by rchan about 3 years ago

  • Sprint changed from Sprint 90 to Sprint 91
Actions #43

Updated by rchan about 3 years ago

  • Sprint changed from Sprint 91 to Sprint 92
Actions #44

Updated by rchan about 3 years ago

  • Sprint changed from Sprint 92 to Sprint 93
Actions #46

Updated by rchan about 3 years ago

  • Sprint changed from Sprint 93 to Sprint 94

Added by dalley about 3 years ago

Revision 0e56b0f4 | View on GitHub

Add support for automatic publishing and distributing

closes: #7469 https://pulp.plan.io/issues/7469

Actions #47

Updated by dalley about 3 years ago

  • Status changed from POST to MODIFIED
Actions #48

Updated by daviddavis about 3 years ago

  • Sprint/Milestone set to 1.7.0
Actions #49

Updated by dalley about 3 years ago

  • Related to Task #8548: Document auto-publish / distribute added
Actions #50

Updated by daviddavis about 3 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF