Story #3844
As a plugin writer, I can use and customize a declarative, concurrent pipeline
0%
Description
Motivation¶
There are several use cases plugin writers have a hard time fulfilling easily with the current plguin API. There are distinct issues, but could represent a collective opportunity for resolution.
Customization Use Cases¶
In PR discussion, gmbnomis brought up an example where he needs to do make additional http calls for content units to be newly created during sync. If he is using a declarative interface he specifically isn't trying to determine on his own which content units need to be created versus those that exist. Making these calls later is both inefficient and could lead to correctness problems if a fatal exception is encountered after being saved with partial information.
The ideal functionality would be akin to adding a "step" in the middle.
Related data Use Cases¶
We've seen several examples of a Content units, e.g. AnsibleRoleVersion, that have ForeignKey relationships to other non-content unit data, e.g. AnsibleRole. During saving newly created AnsibleRoleVersion data may need to be related to existing AnsibleRole data and the generic core machinery doesn't know how to do that. Also different plugins may want different behaviors.
Validation Use Cases¶
Plugin writer's may want to prevent saving of new content units or Artifacts if they fail certain validation. For example when adding a new Content Unit, e.g. AnsibleRoleVersion, lint checks could be run on it to ensure it's quality meets the requirements.
Stream based end-to-end Use Case¶
The plugin writer wants to be able to start processing units (downloading, querying the db, saving, etc) without "all units" being available. This should include the downloading and fetching of initial metadata.
Declarative Use Case¶
To make plugin writer code as easy as possible, having them declare that state of the remote repository and having the core code do the rest is ideal.
Concurrency Use Case¶
Each of the stream processing steps should be able to be efficiently run concurrently. Also we want this concurrency to mix well with the concurrency already used by the downloaders (asyncio).
Possible Resolution¶
Use the producer consumer pattern of asyncio to create a linear pipeline of asyncio stages to create a RepositoryVersion from a stream of unsaved content units and unsaved Artifacts. Plugin writers can inject new, custom stages, reorganize/reuse existing ones, or remove stages to get the stream processing they need.
Overall Design Diagram: https://i.imgur.com/7cEXC5e.png
The design has 3 parts in the pulp/pulp PR.
a) The Stages API itself which is effectively the make_pipeline()
method
b) All of the stages that are already compatible with the Stages API. This is most of the code
c) DeclarativeVersion, an object which assembles a specific pipeline that can provide both sync_mode='additive/mirror' and lazy mode support without customization.
Core code is here: https://github.com/pulp/pulp/compare/master...bmbouter:introducing-asyncio-stages
The pulp_file code is here: https://github.com/pulp/pulp_file/compare/master...bmbouter:introducing-asyncio-stages
Todo list¶
- Tune the pipeline some. the Queue maxsize=100 may be too small.
- Add a limiter to the Artifact download stage that restricts the number of Artifact downloads in-flight.
- Update to use the bulk-create updates from https://pulp.plan.io/issues/3814
- Update to use the bulk-create updates from https://pulp.plan.io/issues/3813
- Add some docs
- use aiofiles to move Artifacts into place just before saving since bulk_create won't call save() for each
Related issues
Associated revisions
Revision 14346421
View on GitHub
half-working
Revision 613f6b6a
View on GitHub
Working, but not bulk_create
Revision 613f6b6a
View on GitHub
Working, but not bulk_create
Revision 789e6f41
View on GitHub
bulk_create does not work
Revision 789e6f41
View on GitHub
bulk_create does not work
Revision 607e1022
View on GitHub
Working, but doesn't remove on mirror
Revision 607e1022
View on GitHub
Working, but doesn't remove on mirror
Revision 2d34c92d
View on GitHub
Last Stage working
Revision 2d34c92d
View on GitHub
Last Stage working
Revision fb03774a
View on GitHub
Moving docs fixes to its own PR
Revision fb03774a
View on GitHub
Moving docs fixes to its own PR
Revision 03473b5c
View on GitHub
sync_mode works and new docstrings
Revision 03473b5c
View on GitHub
sync_mode works and new docstrings
Revision 110945dc
View on GitHub
Adds some docs, committing before code changes
Revision 110945dc
View on GitHub
Adds some docs, committing before code changes
Revision 6b0f0c62
View on GitHub
switched last stage to use QuerSet
Revision 6b0f0c62
View on GitHub
switched last stage to use QuerSet
Revision 56842b8c
View on GitHub
more docstrings
Revision 56842b8c
View on GitHub
more docstrings
Revision 021de19b
View on GitHub
DeclarativeArtiface and DeclarativeContent objects
They were using namedtuples which were not ideal in two ways:
-
They could not be modified so updating them with new results from the database was problematic.
-
They couldn't provide validation for required versus optional fields.
Revision 021de19b
View on GitHub
DeclarativeArtiface and DeclarativeContent objects
They were using namedtuples which were not ideal in two ways:
-
They could not be modified so updating them with new results from the database was problematic.
-
They couldn't provide validation for required versus optional fields.
Revision 57b8fbae
View on GitHub
Add transactions and progress reporting
Revision 57b8fbae
View on GitHub
Add transactions and progress reporting
Revision c115a96a
View on GitHub
finished up docstrings
Revision c115a96a
View on GitHub
finished up docstrings
Revision 51e26e72
View on GitHub
Replace sleep() with efficient get()
Revision 51e26e72
View on GitHub
Replace sleep() with efficient get()
Revision dc2dad8a
View on GitHub
docstring the progress reporting
Revision dc2dad8a
View on GitHub
docstring the progress reporting
Revision 5cb7f0dd
View on GitHub
Make content_association a batch operation
With https://pulp.plan.io/issues/3852 the add_content and remove_content interfaces were switched to using querysets instead of item-by-item usage. This updates DeclarativeVersion and its stages to work with that change.
Overall this makes the content_association stage much more efficient.
This PR also adjusts the data type arriving into the content unit association stage to be DeclarativeVersion.
Revision 5cb7f0dd
View on GitHub
Make content_association a batch operation
With https://pulp.plan.io/issues/3852 the add_content and remove_content interfaces were switched to using querysets instead of item-by-item usage. This updates DeclarativeVersion and its stages to work with that change.
Overall this makes the content_association stage much more efficient.
This PR also adjusts the data type arriving into the content unit association stage to be DeclarativeVersion.
Revision 54c628ce
View on GitHub
Adds max_downloads feature to downloader stage
Revision 54c628ce
View on GitHub
Adds max_downloads feature to downloader stage
Revision 4523f5fc
View on GitHub
Introducing FirstStage
gmbnomis suggested some changes, so this PR implements them.
https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942
It has a bug in it, which I need to resolve. It also doesn't save the files correctly. I am still making Artifact save compatible with bulk_create()
Revision 4523f5fc
View on GitHub
Introducing FirstStage
gmbnomis suggested some changes, so this PR implements them.
https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942
It has a bug in it, which I need to resolve. It also doesn't save the files correctly. I am still making Artifact save compatible with bulk_create()
Revision 3fdd849e
View on GitHub
Fixes flake8 errors
Revision 3fdd849e
View on GitHub
Fixes flake8 errors
Revision afbaaefc
View on GitHub
files now save in the backend with bulk_create
Revision afbaaefc
View on GitHub
files now save in the backend with bulk_create
Revision 784e7788
View on GitHub
Fixed bug, it works again!
Revision 784e7788
View on GitHub
Fixed bug, it works again!
Revision 642e47f6
View on GitHub
Ok now it's really working for the second sync too.
Revision 642e47f6
View on GitHub
Ok now it's really working for the second sync too.
Revision 51330150
View on GitHub
Removing accidental docs change
Revision 51330150
View on GitHub
Removing accidental docs change
Revision c0ceaeb6
View on GitHub
Fixes a bug caused by using ints not uuids
autoincrement fields cause you to have to get the db-assigned ids from the database after the first bulk_create(). Otherwise the RemoteContent.content_artifact attributes will be None.
Revision c0ceaeb6
View on GitHub
Fixes a bug caused by using ints not uuids
autoincrement fields cause you to have to get the db-assigned ids from the database after the first bulk_create(). Otherwise the RemoteContent.content_artifact attributes will be None.
Revision 9ddbb3c6
View on GitHub
flake8 fixes
Revision 9ddbb3c6
View on GitHub
flake8 fixes
Revision 0a5ac52d
View on GitHub
Do away with the closures
This replaces the closures with classes that can hold the required values. Closure's aren't entirely needed here.
Revision 0a5ac52d
View on GitHub
Do away with the closures
This replaces the closures with classes that can hold the required values. Closure's aren't entirely needed here.
Revision 304dedeb
View on GitHub
Breaking up huge module into a package of modules
All of the user facing plugin-API things are importable from pulpcore.plugin.stages.
Revision 304dedeb
View on GitHub
Breaking up huge module into a package of modules
All of the user facing plugin-API things are importable from pulpcore.plugin.stages.
Revision 0d16d8a6
View on GitHub
Some fixes related to file handling
- redo save() to inspect .pks instead
- replace the usage of shutil.move() with django's storage backend save() method.
Revision 0d16d8a6
View on GitHub
Some fixes related to file handling
- redo save() to inspect .pks instead
- replace the usage of shutil.move() with django's storage backend save() method.
Revision 55553cf3
View on GitHub
Adds docs for the stages package
Revision 55553cf3
View on GitHub
Adds docs for the stages package
Revision 05bb5f86
View on GitHub
Lots of docs fixes
Revision 05bb5f86
View on GitHub
Lots of docs fixes
Revision 082e1e48
View on GitHub
Updating stages w/ some clearer names
Revision 082e1e48
View on GitHub
Updating stages w/ some clearer names
Revision 34c8536b
View on GitHub
Various fixes from review
Mostly fixes class names for pep8 compliance
Revision 34c8536b
View on GitHub
Various fixes from review
Mostly fixes class names for pep8 compliance
Revision bc6d425e
View on GitHub
All stages are classes now
This is a big diff, but mainly due only to indention
Revision bc6d425e
View on GitHub
All stages are classes now
This is a big diff, but mainly due only to indention
Revision 856d090e
View on GitHub
flake8 fix
Revision 856d090e
View on GitHub
flake8 fix
Revision 579c7004
View on GitHub
Final round of fixes
- switch storage backend to DefaultStorage()
- super() is now called first in anytime it's used
- FirstStage is removed
- BaseStage -> Stage
- the relative path now handles duplicate rel_paths correctly
- documenting args and kwargs
I did some final hand testing, and I also ensure the docs build and look good.
Revision 579c7004
View on GitHub
Final round of fixes
- switch storage backend to DefaultStorage()
- super() is now called first in anytime it's used
- FirstStage is removed
- BaseStage -> Stage
- the relative path now handles duplicate rel_paths correctly
- documenting args and kwargs
I did some final hand testing, and I also ensure the docs build and look good.
Revision 631031e3
View on GitHub
Fixing a bug where 2 artifacts are used
Without this commit, an exception will occur when one content unit has two or more Artifacts are associated with it.
Revision 631031e3
View on GitHub
Fixing a bug where 2 artifacts are used
Without this commit, an exception will occur when one content unit has two or more Artifacts are associated with it.
Revision 4c3b43a6
View on GitHub
Fixes string comparison
Revision 4c3b43a6
View on GitHub
Fixes string comparison
Revision ca7925b3
View on GitHub
Have DeclarativeVersion validate download digests
The downloaders created by DeclarativeVersion didn't specify any digest or size information to use for validation even if it was available. I've tested this code and it worked for me.
Revision ca7925b3
View on GitHub
Have DeclarativeVersion validate download digests
The downloaders created by DeclarativeVersion didn't specify any digest or size information to use for validation even if it was available. I've tested this code and it worked for me.
History
#2
Updated by bmbouter over 2 years ago
- Sprint Candidate changed from No to Yes
#3
Updated by bmbouter over 2 years ago
- Description updated (diff)
#4
Updated by bmbouter over 2 years ago
- Description updated (diff)
#5
Updated by jortel@redhat.com over 2 years ago
- Groomed changed from No to Yes
#6
Updated by dkliban@redhat.com over 2 years ago
- Sprint set to Sprint 40
#7
Updated by bmbouter over 2 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bmbouter
#8
Updated by bmbouter over 2 years ago
- Status changed from ASSIGNED to POST
#9
Updated by bmbouter over 2 years ago
- Blocks Task #3890: Port pulp_file to use DeclarativeVersion added
#11
Updated by daviddavis almost 2 years ago
- Sprint/Milestone set to 3.0.0
#12
Updated by bmbouter over 1 year ago
- Tags deleted (
Pulp 3)
#13
Updated by bmbouter about 1 year ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Please register to edit this issue
half-working
https://pulp.plan.io/issues/3844 re #3844