Story #3844
closedAs a plugin writer, I can use and customize a declarative, concurrent pipeline
0%
Description
Motivation¶
There are several use cases plugin writers have a hard time fulfilling easily with the current plguin API. There are distinct issues, but could represent a collective opportunity for resolution.
Customization Use Cases¶
In PR discussion, gmbnomis brought up an example where he needs to do make additional http calls for content units to be newly created during sync. If he is using a declarative interface he specifically isn't trying to determine on his own which content units need to be created versus those that exist. Making these calls later is both inefficient and could lead to correctness problems if a fatal exception is encountered after being saved with partial information.
The ideal functionality would be akin to adding a "step" in the middle.
Related data Use Cases¶
We've seen several examples of a Content units, e.g. AnsibleRoleVersion, that have ForeignKey relationships to other non-content unit data, e.g. AnsibleRole. During saving newly created AnsibleRoleVersion data may need to be related to existing AnsibleRole data and the generic core machinery doesn't know how to do that. Also different plugins may want different behaviors.
Validation Use Cases¶
Plugin writer's may want to prevent saving of new content units or Artifacts if they fail certain validation. For example when adding a new Content Unit, e.g. AnsibleRoleVersion, lint checks could be run on it to ensure it's quality meets the requirements.
Stream based end-to-end Use Case¶
The plugin writer wants to be able to start processing units (downloading, querying the db, saving, etc) without "all units" being available. This should include the downloading and fetching of initial metadata.
Declarative Use Case¶
To make plugin writer code as easy as possible, having them declare that state of the remote repository and having the core code do the rest is ideal.
Concurrency Use Case¶
Each of the stream processing steps should be able to be efficiently run concurrently. Also we want this concurrency to mix well with the concurrency already used by the downloaders (asyncio).
Possible Resolution¶
Use the producer consumer pattern of asyncio to create a linear pipeline of asyncio stages to create a RepositoryVersion from a stream of unsaved content units and unsaved Artifacts. Plugin writers can inject new, custom stages, reorganize/reuse existing ones, or remove stages to get the stream processing they need.
Overall Design Diagram: https://i.imgur.com/7cEXC5e.png
The design has 3 parts in the pulp/pulp PR.
a) The Stages API itself which is effectively the make_pipeline()
method
b) All of the stages that are already compatible with the Stages API. This is most of the code
c) DeclarativeVersion, an object which assembles a specific pipeline that can provide both sync_mode='additive/mirror' and lazy mode support without customization.
Core code is here: https://github.com/pulp/pulp/compare/master...bmbouter:introducing-asyncio-stages
The pulp_file code is here: https://github.com/pulp/pulp_file/compare/master...bmbouter:introducing-asyncio-stages
Todo list¶
- Tune the pipeline some. the Queue maxsize=100 may be too small.
- Add a limiter to the Artifact download stage that restricts the number of Artifact downloads in-flight.
- Update to use the bulk-create updates from https://pulp.plan.io/issues/3814
- Update to use the bulk-create updates from https://pulp.plan.io/issues/3813
- Add some docs
- use aiofiles to move Artifacts into place just before saving since bulk_create won't call save() for each
Related issues
Updated by bmbouter over 6 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bmbouter
Updated by bmbouter over 6 years ago
- Status changed from ASSIGNED to POST
Updated by bmbouter over 6 years ago
- Blocks Task #3890: Port pulp_file to use DeclarativeVersion added
Added by bmbouter over 6 years ago
Added by bmbouter over 6 years ago
Added by bmbouter over 6 years ago
Added by bmbouter over 6 years ago
Revision 607e1022 | View on GitHub
Working, but doesn't remove on mirror
Added by bmbouter over 6 years ago
Revision 607e1022 | View on GitHub
Working, but doesn't remove on mirror
Added by bmbouter over 6 years ago
Revision fb03774a | View on GitHub
Moving docs fixes to its own PR
Added by bmbouter over 6 years ago
Revision fb03774a | View on GitHub
Moving docs fixes to its own PR
Added by bmbouter over 6 years ago
Revision 03473b5c | View on GitHub
sync_mode works and new docstrings
Added by bmbouter over 6 years ago
Revision 03473b5c | View on GitHub
sync_mode works and new docstrings
Added by bmbouter over 6 years ago
Revision 110945dc | View on GitHub
Adds some docs, committing before code changes
Added by bmbouter over 6 years ago
Revision 110945dc | View on GitHub
Adds some docs, committing before code changes
Added by bmbouter over 6 years ago
Revision 6b0f0c62 | View on GitHub
switched last stage to use QuerSet
Added by bmbouter over 6 years ago
Revision 6b0f0c62 | View on GitHub
switched last stage to use QuerSet
Added by bmbouter over 6 years ago
Revision 021de19b | View on GitHub
DeclarativeArtiface and DeclarativeContent objects
They were using namedtuples which were not ideal in two ways:
-
They could not be modified so updating them with new results from the database was problematic.
-
They couldn't provide validation for required versus optional fields.
Added by bmbouter over 6 years ago
Revision 021de19b | View on GitHub
DeclarativeArtiface and DeclarativeContent objects
They were using namedtuples which were not ideal in two ways:
-
They could not be modified so updating them with new results from the database was problematic.
-
They couldn't provide validation for required versus optional fields.
Added by bmbouter over 6 years ago
Revision 57b8fbae | View on GitHub
Add transactions and progress reporting
Added by bmbouter over 6 years ago
Revision 57b8fbae | View on GitHub
Add transactions and progress reporting
Added by bmbouter over 6 years ago
Revision 51e26e72 | View on GitHub
Replace sleep() with efficient get()
Added by bmbouter over 6 years ago
Revision 51e26e72 | View on GitHub
Replace sleep() with efficient get()
Added by bmbouter over 6 years ago
Revision dc2dad8a | View on GitHub
docstring the progress reporting
Added by bmbouter over 6 years ago
Revision dc2dad8a | View on GitHub
docstring the progress reporting
Added by bmbouter over 6 years ago
Revision 5cb7f0dd | View on GitHub
Make content_association a batch operation
With https://pulp.plan.io/issues/3852 the add_content and remove_content interfaces were switched to using querysets instead of item-by-item usage. This updates DeclarativeVersion and its stages to work with that change.
Overall this makes the content_association stage much more efficient.
This PR also adjusts the data type arriving into the content unit association stage to be DeclarativeVersion.
Added by bmbouter over 6 years ago
Revision 5cb7f0dd | View on GitHub
Make content_association a batch operation
With https://pulp.plan.io/issues/3852 the add_content and remove_content interfaces were switched to using querysets instead of item-by-item usage. This updates DeclarativeVersion and its stages to work with that change.
Overall this makes the content_association stage much more efficient.
This PR also adjusts the data type arriving into the content unit association stage to be DeclarativeVersion.
Added by bmbouter over 6 years ago
Revision 54c628ce | View on GitHub
Adds max_downloads feature to downloader stage
Added by bmbouter over 6 years ago
Revision 54c628ce | View on GitHub
Adds max_downloads feature to downloader stage
Added by bmbouter over 6 years ago
Revision 4523f5fc | View on GitHub
Introducing FirstStage
gmbnomis suggested some changes, so this PR implements them.
https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942
It has a bug in it, which I need to resolve. It also doesn't save the files correctly. I am still making Artifact save compatible with bulk_create()
Added by bmbouter over 6 years ago
Revision 4523f5fc | View on GitHub
Introducing FirstStage
gmbnomis suggested some changes, so this PR implements them.
https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942
It has a bug in it, which I need to resolve. It also doesn't save the files correctly. I am still making Artifact save compatible with bulk_create()
Added by bmbouter over 6 years ago
Revision afbaaefc | View on GitHub
files now save in the backend with bulk_create
Added by bmbouter over 6 years ago
Revision afbaaefc | View on GitHub
files now save in the backend with bulk_create
Added by bmbouter over 6 years ago
Revision 642e47f6 | View on GitHub
Ok now it's really working for the second sync too.
Added by bmbouter over 6 years ago
Revision 642e47f6 | View on GitHub
Ok now it's really working for the second sync too.
Added by bmbouter over 6 years ago
Revision 51330150 | View on GitHub
Removing accidental docs change
Added by bmbouter over 6 years ago
Revision 51330150 | View on GitHub
Removing accidental docs change
Added by bmbouter over 6 years ago
Revision c0ceaeb6 | View on GitHub
Fixes a bug caused by using ints not uuids
autoincrement fields cause you to have to get the db-assigned ids from the database after the first bulk_create(). Otherwise the RemoteContent.content_artifact attributes will be None.
Added by bmbouter over 6 years ago
Revision c0ceaeb6 | View on GitHub
Fixes a bug caused by using ints not uuids
autoincrement fields cause you to have to get the db-assigned ids from the database after the first bulk_create(). Otherwise the RemoteContent.content_artifact attributes will be None.
Added by bmbouter over 6 years ago
Revision 0a5ac52d | View on GitHub
Do away with the closures
This replaces the closures with classes that can hold the required values. Closure's aren't entirely needed here.
Added by bmbouter over 6 years ago
Revision 0a5ac52d | View on GitHub
Do away with the closures
This replaces the closures with classes that can hold the required values. Closure's aren't entirely needed here.
Added by bmbouter over 6 years ago
Revision 304dedeb | View on GitHub
Breaking up huge module into a package of modules
All of the user facing plugin-API things are importable from pulpcore.plugin.stages.
Added by bmbouter over 6 years ago
Revision 304dedeb | View on GitHub
Breaking up huge module into a package of modules
All of the user facing plugin-API things are importable from pulpcore.plugin.stages.
Added by bmbouter over 6 years ago
Revision 0d16d8a6 | View on GitHub
Some fixes related to file handling
- redo save() to inspect .pks instead
- replace the usage of shutil.move() with django's storage backend save() method.
Added by bmbouter over 6 years ago
Revision 0d16d8a6 | View on GitHub
Some fixes related to file handling
- redo save() to inspect .pks instead
- replace the usage of shutil.move() with django's storage backend save() method.
Added by bmbouter over 6 years ago
Revision 55553cf3 | View on GitHub
Adds docs for the stages package
Added by bmbouter over 6 years ago
Revision 55553cf3 | View on GitHub
Adds docs for the stages package
Added by bmbouter over 6 years ago
Revision 082e1e48 | View on GitHub
Updating stages w/ some clearer names
Added by bmbouter over 6 years ago
Revision 082e1e48 | View on GitHub
Updating stages w/ some clearer names
Added by bmbouter over 6 years ago
Revision 34c8536b | View on GitHub
Various fixes from review
Mostly fixes class names for pep8 compliance
Added by bmbouter over 6 years ago
Revision 34c8536b | View on GitHub
Various fixes from review
Mostly fixes class names for pep8 compliance
Added by bmbouter over 6 years ago
Revision bc6d425e | View on GitHub
All stages are classes now
This is a big diff, but mainly due only to indention
Added by bmbouter over 6 years ago
Revision bc6d425e | View on GitHub
All stages are classes now
This is a big diff, but mainly due only to indention
Added by bmbouter over 6 years ago
Revision 579c7004 | View on GitHub
Final round of fixes
- switch storage backend to DefaultStorage()
- super() is now called first in anytime it's used
- FirstStage is removed
- BaseStage -> Stage
- the relative path now handles duplicate rel_paths correctly
- documenting args and kwargs
I did some final hand testing, and I also ensure the docs build and look good.
Added by bmbouter over 6 years ago
Revision 579c7004 | View on GitHub
Final round of fixes
- switch storage backend to DefaultStorage()
- super() is now called first in anytime it's used
- FirstStage is removed
- BaseStage -> Stage
- the relative path now handles duplicate rel_paths correctly
- documenting args and kwargs
I did some final hand testing, and I also ensure the docs build and look good.
Added by bmbouter over 6 years ago
Revision 631031e3 | View on GitHub
Fixing a bug where 2 artifacts are used
Without this commit, an exception will occur when one content unit has two or more Artifacts are associated with it.
Added by bmbouter over 6 years ago
Revision 631031e3 | View on GitHub
Fixing a bug where 2 artifacts are used
Without this commit, an exception will occur when one content unit has two or more Artifacts are associated with it.
Added by bmbouter over 6 years ago
Revision ca7925b3 | View on GitHub
Have DeclarativeVersion validate download digests
The downloaders created by DeclarativeVersion didn't specify any digest or size information to use for validation even if it was available. I've tested this code and it worked for me.
Added by bmbouter over 6 years ago
Revision ca7925b3 | View on GitHub
Have DeclarativeVersion validate download digests
The downloaders created by DeclarativeVersion didn't specify any digest or size information to use for validation even if it was available. I've tested this code and it worked for me.
Updated by bmbouter almost 5 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
half-working
https://pulp.plan.io/issues/3844 re #3844