Project

Profile

Help

Story #3844

As a plugin writer, I can use and customize a declarative, concurrent pipeline

Added by bmbouter over 1 year ago. Updated 8 months ago.

Status:
MODIFIED
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:
Sprint 40

Description

Motivation

There are several use cases plugin writers have a hard time fulfilling easily with the current plguin API. There are distinct issues, but could represent a collective opportunity for resolution.

Customization Use Cases

In PR discussion, @gmbnomis brought up an example where he needs to do make additional http calls for content units to be newly created during sync. If he is using a declarative interface he specifically isn't trying to determine on his own which content units need to be created versus those that exist. Making these calls later is both inefficient and could lead to correctness problems if a fatal exception is encountered after being saved with partial information.

The ideal functionality would be akin to adding a "step" in the middle.

Related data Use Cases

We've seen several examples of a Content units, e.g. AnsibleRoleVersion, that have ForeignKey relationships to other non-content unit data, e.g. AnsibleRole. During saving newly created AnsibleRoleVersion data may need to be related to existing AnsibleRole data and the generic core machinery doesn't know how to do that. Also different plugins may want different behaviors.

Validation Use Cases

Plugin writer's may want to prevent saving of new content units or Artifacts if they fail certain validation. For example when adding a new Content Unit, e.g. AnsibleRoleVersion, lint checks could be run on it to ensure it's quality meets the requirements.

Stream based end-to-end Use Case

The plugin writer wants to be able to start processing units (downloading, querying the db, saving, etc) without "all units" being available. This should include the downloading and fetching of initial metadata.

Declarative Use Case

To make plugin writer code as easy as possible, having them declare that state of the remote repository and having the core code do the rest is ideal.

Concurrency Use Case

Each of the stream processing steps should be able to be efficiently run concurrently. Also we want this concurrency to mix well with the concurrency already used by the downloaders (asyncio).

Possible Resolution

Use the producer consumer pattern of asyncio to create a linear pipeline of asyncio stages to create a RepositoryVersion from a stream of unsaved content units and unsaved Artifacts. Plugin writers can inject new, custom stages, reorganize/reuse existing ones, or remove stages to get the stream processing they need.

Overall Design Diagram: https://i.imgur.com/7cEXC5e.png

The design has 3 parts in the pulp/pulp PR.

a) The Stages API itself which is effectively the make_pipeline() method
b) All of the stages that are already compatible with the Stages API. This is most of the code
c) DeclarativeVersion, an object which assembles a specific pipeline that can provide both sync_mode='additive/mirror' and lazy mode support without customization.

Core code is here: https://github.com/pulp/pulp/compare/master...bmbouter:introducing-asyncio-stages

The pulp_file code is here: https://github.com/pulp/pulp_file/compare/master...bmbouter:introducing-asyncio-stages

Todo list

  • Tune the pipeline some. the Queue maxsize=100 may be too small.
  • Add a limiter to the Artifact download stage that restricts the number of Artifact downloads in-flight.
  • Update to use the bulk-create updates from https://pulp.plan.io/issues/3814
  • Update to use the bulk-create updates from https://pulp.plan.io/issues/3813
  • Add some docs
  • use aiofiles to move Artifacts into place just before saving since bulk_create won't call save() for each

Related issues

Blocks File Support - Task #3890: Port pulp_file to use DeclarativeVersion MODIFIED Actions

Associated revisions

Revision 6990563f View on GitHub
Added by bmbouter over 1 year ago

need relative_path at later stage.

just about to refactor

https://pulp.plan.io/issues/3844
re #3844

Revision 6990563f View on GitHub
Added by bmbouter over 1 year ago

need relative_path at later stage.

just about to refactor

https://pulp.plan.io/issues/3844
re #3844

Revision 607e1022 View on GitHub
Added by bmbouter over 1 year ago

Working, but doesn't remove on mirror

https://pulp.plan.io/issues/3844
re #3844

Revision 607e1022 View on GitHub
Added by bmbouter over 1 year ago

Working, but doesn't remove on mirror

https://pulp.plan.io/issues/3844
re #3844

Revision 110945dc View on GitHub
Added by bmbouter over 1 year ago

Adds some docs, committing before code changes

https://pulp.plan.io/issues/3844
re #3844

Revision 110945dc View on GitHub
Added by bmbouter over 1 year ago

Adds some docs, committing before code changes

https://pulp.plan.io/issues/3844
re #3844

Revision 021de19b View on GitHub
Added by bmbouter over 1 year ago

DeclarativeArtiface and DeclarativeContent objects

They were using namedtuples which were not ideal in two ways:

1. They could not be modified so updating them with new results from the
database was problematic.

2. They couldn't provide validation for required versus optional fields.

https://pulp.plan.io/issues/3844
re #3844

Revision 021de19b View on GitHub
Added by bmbouter over 1 year ago

DeclarativeArtiface and DeclarativeContent objects

They were using namedtuples which were not ideal in two ways:

1. They could not be modified so updating them with new results from the
database was problematic.

2. They couldn't provide validation for required versus optional fields.

https://pulp.plan.io/issues/3844
re #3844

Revision 57b8fbae View on GitHub
Added by bmbouter over 1 year ago

Add transactions and progress reporting

https://pulp.plan.io/issues/3844
re #3844

Revision 57b8fbae View on GitHub
Added by bmbouter over 1 year ago

Add transactions and progress reporting

https://pulp.plan.io/issues/3844
re #3844

Revision 51e26e72 View on GitHub
Added by bmbouter over 1 year ago

Replace sleep() with efficient get()

https://pulp.plan.io/issues/3844
re #3844

Revision 51e26e72 View on GitHub
Added by bmbouter over 1 year ago

Replace sleep() with efficient get()

https://pulp.plan.io/issues/3844
re #3844

Revision 5cb7f0dd View on GitHub
Added by bmbouter over 1 year ago

Make content_association a batch operation

With https://pulp.plan.io/issues/3852 the add_content and remove_content
interfaces were switched to using querysets instead of item-by-item
usage. This updates DeclarativeVersion and its stages to work with that
change.

Overall this makes the content_association stage much more efficient.

This PR also adjusts the data type arriving into the content unit
association stage to be DeclarativeVersion.

https://pulp.plan.io/issues/3844
re #3844

Revision 5cb7f0dd View on GitHub
Added by bmbouter over 1 year ago

Make content_association a batch operation

With https://pulp.plan.io/issues/3852 the add_content and remove_content
interfaces were switched to using querysets instead of item-by-item
usage. This updates DeclarativeVersion and its stages to work with that
change.

Overall this makes the content_association stage much more efficient.

This PR also adjusts the data type arriving into the content unit
association stage to be DeclarativeVersion.

https://pulp.plan.io/issues/3844
re #3844

Revision 54c628ce View on GitHub
Added by bmbouter over 1 year ago

Adds max_downloads feature to downloader stage

https://pulp.plan.io/issues/3844
re #3844

Revision 54c628ce View on GitHub
Added by bmbouter over 1 year ago

Adds max_downloads feature to downloader stage

https://pulp.plan.io/issues/3844
re #3844

Revision 4523f5fc View on GitHub
Added by bmbouter over 1 year ago

Introducing FirstStage

@gmbnomis suggested some changes, so this PR implements them.

https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942

It has a bug in it, which I need to resolve.
It also doesn't save the files correctly. I am still making Artifact
save compatible with bulk_create()

https://pulp.plan.io/issues/3844
re #3844

Revision 4523f5fc View on GitHub
Added by bmbouter over 1 year ago

Introducing FirstStage

@gmbnomis suggested some changes, so this PR implements them.

https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942

It has a bug in it, which I need to resolve.
It also doesn't save the files correctly. I am still making Artifact
save compatible with bulk_create()

https://pulp.plan.io/issues/3844
re #3844

Revision afbaaefc View on GitHub
Added by bmbouter over 1 year ago

files now save in the backend with bulk_create

https://pulp.plan.io/issues/3844
re #3844

Revision afbaaefc View on GitHub
Added by bmbouter over 1 year ago

files now save in the backend with bulk_create

https://pulp.plan.io/issues/3844
re #3844

Revision 642e47f6 View on GitHub
Added by bmbouter over 1 year ago

Ok now it's really working for the second sync too.

https://pulp.plan.io/issues/3844
re #3844

Revision 642e47f6 View on GitHub
Added by bmbouter over 1 year ago

Ok now it's really working for the second sync too.

https://pulp.plan.io/issues/3844
re #3844

Revision c0ceaeb6 View on GitHub
Added by bmbouter over 1 year ago

Fixes a bug caused by using ints not uuids

autoincrement fields cause you to have to get the db-assigned ids from
the database after the first bulk_create(). Otherwise the
RemoteContent.content_artifact attributes will be None.

https://pulp.plan.io/issues/3844
re #3844

Revision c0ceaeb6 View on GitHub
Added by bmbouter over 1 year ago

Fixes a bug caused by using ints not uuids

autoincrement fields cause you to have to get the db-assigned ids from
the database after the first bulk_create(). Otherwise the
RemoteContent.content_artifact attributes will be None.

https://pulp.plan.io/issues/3844
re #3844

Revision 0a5ac52d View on GitHub
Added by bmbouter over 1 year ago

Do away with the closures

This replaces the closures with classes that can hold the required
values. Closure's aren't entirely needed here.

https://pulp.plan.io/issues/3844
re #3844

Revision 0a5ac52d View on GitHub
Added by bmbouter over 1 year ago

Do away with the closures

This replaces the closures with classes that can hold the required
values. Closure's aren't entirely needed here.

https://pulp.plan.io/issues/3844
re #3844

Revision 304dedeb View on GitHub
Added by bmbouter over 1 year ago

Breaking up huge module into a package of modules

All of the user facing plugin-API things are importable from
pulpcore.plugin.stages.

https://pulp.plan.io/issues/3844
re #3844

Revision 304dedeb View on GitHub
Added by bmbouter over 1 year ago

Breaking up huge module into a package of modules

All of the user facing plugin-API things are importable from
pulpcore.plugin.stages.

https://pulp.plan.io/issues/3844
re #3844

Revision 0d16d8a6 View on GitHub
Added by bmbouter over 1 year ago

Some fixes related to file handling

- redo save() to inspect .pks instead
- replace the usage of shutil.move() with django's storage backend
save() method.

https://pulp.plan.io/issues/3844
re #3844

Revision 0d16d8a6 View on GitHub
Added by bmbouter over 1 year ago

Some fixes related to file handling

- redo save() to inspect .pks instead
- replace the usage of shutil.move() with django's storage backend
save() method.

https://pulp.plan.io/issues/3844
re #3844

Revision 082e1e48 View on GitHub
Added by bmbouter over 1 year ago

Updating stages w/ some clearer names

https://pulp.plan.io/issues/3844
re #3844

Revision 082e1e48 View on GitHub
Added by bmbouter over 1 year ago

Updating stages w/ some clearer names

https://pulp.plan.io/issues/3844
re #3844

Revision 34c8536b View on GitHub
Added by bmbouter over 1 year ago

Various fixes from review

Mostly fixes class names for pep8 compliance

https://pulp.plan.io/issues/3844
re #3844

Revision 34c8536b View on GitHub
Added by bmbouter over 1 year ago

Various fixes from review

Mostly fixes class names for pep8 compliance

https://pulp.plan.io/issues/3844
re #3844

Revision bc6d425e View on GitHub
Added by bmbouter over 1 year ago

All stages are classes now

This is a big diff, but mainly due only to indention

https://pulp.plan.io/issues/3844
re #3844

Revision bc6d425e View on GitHub
Added by bmbouter over 1 year ago

All stages are classes now

This is a big diff, but mainly due only to indention

https://pulp.plan.io/issues/3844
re #3844

Revision 579c7004 View on GitHub
Added by bmbouter over 1 year ago

Final round of fixes

  • switch storage backend to DefaultStorage()
  • super() is now called first in anytime it's used
  • FirstStage is removed
  • BaseStage -> Stage
  • the relative path now handles duplicate rel_paths correctly
  • documenting args and kwargs

I did some final hand testing, and I also ensure the docs build and look
good.

https://pulp.plan.io/issues/3844
re #3844

Revision 579c7004 View on GitHub
Added by bmbouter over 1 year ago

Final round of fixes

  • switch storage backend to DefaultStorage()
  • super() is now called first in anytime it's used
  • FirstStage is removed
  • BaseStage -> Stage
  • the relative path now handles duplicate rel_paths correctly
  • documenting args and kwargs

I did some final hand testing, and I also ensure the docs build and look
good.

https://pulp.plan.io/issues/3844
re #3844

Revision 631031e3 View on GitHub
Added by bmbouter over 1 year ago

Fixing a bug where 2 artifacts are used

Without this commit, an exception will occur when one content unit has
two or more Artifacts are associated with it.

https://pulp.plan.io/issues/3844
re #3844

Revision 631031e3 View on GitHub
Added by bmbouter over 1 year ago

Fixing a bug where 2 artifacts are used

Without this commit, an exception will occur when one content unit has
two or more Artifacts are associated with it.

https://pulp.plan.io/issues/3844
re #3844

Revision ca7925b3 View on GitHub
Added by bmbouter over 1 year ago

Have DeclarativeVersion validate download digests

The downloaders created by DeclarativeVersion didn't specify any digest
or size information to use for validation even if it was available. I've
tested this code and it worked for me.

https://pulp.plan.io/issues/3844
re #3844

Revision ca7925b3 View on GitHub
Added by bmbouter over 1 year ago

Have DeclarativeVersion validate download digests

The downloaders created by DeclarativeVersion didn't specify any digest
or size information to use for validation even if it was available. I've
tested this code and it worked for me.

https://pulp.plan.io/issues/3844
re #3844

History

#1 Updated by bmbouter over 1 year ago

  • Description updated (diff)

Adding some todos

#2 Updated by bmbouter over 1 year ago

  • Sprint Candidate changed from No to Yes

#3 Updated by bmbouter over 1 year ago

  • Description updated (diff)

#4 Updated by bmbouter over 1 year ago

  • Description updated (diff)

#5 Updated by jortel@redhat.com over 1 year ago

  • Groomed changed from No to Yes

#6 Updated by dkliban@redhat.com over 1 year ago

  • Sprint set to Sprint 40

#7 Updated by bmbouter over 1 year ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter

#9 Updated by bmbouter over 1 year ago

  • Blocks Task #3890: Port pulp_file to use DeclarativeVersion added

#10 Updated by bmbouter over 1 year ago

  • Status changed from POST to MODIFIED

This is merged.

#11 Updated by daviddavis 8 months ago

  • Sprint/Milestone set to 3.0

#12 Updated by bmbouter 8 months ago

  • Tags deleted (Pulp 3)

Please register to edit this issue

Also available in: Atom PDF