Project

Profile

Help

Story #3844

closed

As a plugin writer, I can use and customize a declarative, concurrent pipeline

Added by bmbouter over 6 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Sprint:
Sprint 40
Quarter:

Description

Motivation

There are several use cases plugin writers have a hard time fulfilling easily with the current plguin API. There are distinct issues, but could represent a collective opportunity for resolution.

Customization Use Cases

In PR discussion, gmbnomis brought up an example where he needs to do make additional http calls for content units to be newly created during sync. If he is using a declarative interface he specifically isn't trying to determine on his own which content units need to be created versus those that exist. Making these calls later is both inefficient and could lead to correctness problems if a fatal exception is encountered after being saved with partial information.

The ideal functionality would be akin to adding a "step" in the middle.

Related data Use Cases

We've seen several examples of a Content units, e.g. AnsibleRoleVersion, that have ForeignKey relationships to other non-content unit data, e.g. AnsibleRole. During saving newly created AnsibleRoleVersion data may need to be related to existing AnsibleRole data and the generic core machinery doesn't know how to do that. Also different plugins may want different behaviors.

Validation Use Cases

Plugin writer's may want to prevent saving of new content units or Artifacts if they fail certain validation. For example when adding a new Content Unit, e.g. AnsibleRoleVersion, lint checks could be run on it to ensure it's quality meets the requirements.

Stream based end-to-end Use Case

The plugin writer wants to be able to start processing units (downloading, querying the db, saving, etc) without "all units" being available. This should include the downloading and fetching of initial metadata.

Declarative Use Case

To make plugin writer code as easy as possible, having them declare that state of the remote repository and having the core code do the rest is ideal.

Concurrency Use Case

Each of the stream processing steps should be able to be efficiently run concurrently. Also we want this concurrency to mix well with the concurrency already used by the downloaders (asyncio).

Possible Resolution

Use the producer consumer pattern of asyncio to create a linear pipeline of asyncio stages to create a RepositoryVersion from a stream of unsaved content units and unsaved Artifacts. Plugin writers can inject new, custom stages, reorganize/reuse existing ones, or remove stages to get the stream processing they need.

Overall Design Diagram: https://i.imgur.com/7cEXC5e.png

The design has 3 parts in the pulp/pulp PR.

a) The Stages API itself which is effectively the make_pipeline() method
b) All of the stages that are already compatible with the Stages API. This is most of the code
c) DeclarativeVersion, an object which assembles a specific pipeline that can provide both sync_mode='additive/mirror' and lazy mode support without customization.

Core code is here: https://github.com/pulp/pulp/compare/master...bmbouter:introducing-asyncio-stages

The pulp_file code is here: https://github.com/pulp/pulp_file/compare/master...bmbouter:introducing-asyncio-stages

Todo list

  • Tune the pipeline some. the Queue maxsize=100 may be too small.
  • Add a limiter to the Artifact download stage that restricts the number of Artifact downloads in-flight.
  • Update to use the bulk-create updates from https://pulp.plan.io/issues/3814
  • Update to use the bulk-create updates from https://pulp.plan.io/issues/3813
  • Add some docs
  • use aiofiles to move Artifacts into place just before saving since bulk_create won't call save() for each

Related issues

Blocks File Support - Task #3890: Port pulp_file to use DeclarativeVersionCLOSED - CURRENTRELEASEbmbouter

Actions
Actions #1

Updated by bmbouter over 6 years ago

  • Description updated (diff)

Adding some todos

Actions #2

Updated by bmbouter over 6 years ago

  • Sprint Candidate changed from No to Yes
Actions #3

Updated by bmbouter over 6 years ago

  • Description updated (diff)
Actions #4

Updated by bmbouter over 6 years ago

  • Description updated (diff)
Actions #5

Updated by jortel@redhat.com over 6 years ago

  • Groomed changed from No to Yes
Actions #6

Updated by dkliban@redhat.com over 6 years ago

  • Sprint set to Sprint 40
Actions #7

Updated by bmbouter over 6 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
Actions #9

Updated by bmbouter over 6 years ago

  • Blocks Task #3890: Port pulp_file to use DeclarativeVersion added

Added by bmbouter over 6 years ago

Revision 6990563f | View on GitHub

need relative_path at later stage.

just about to refactor

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 6990563f | View on GitHub

need relative_path at later stage.

just about to refactor

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 607e1022 | View on GitHub

Working, but doesn't remove on mirror

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 607e1022 | View on GitHub

Working, but doesn't remove on mirror

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 03473b5c | View on GitHub

sync_mode works and new docstrings

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 03473b5c | View on GitHub

sync_mode works and new docstrings

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 110945dc | View on GitHub

Adds some docs, committing before code changes

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 110945dc | View on GitHub

Adds some docs, committing before code changes

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 6b0f0c62 | View on GitHub

switched last stage to use QuerSet

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 6b0f0c62 | View on GitHub

switched last stage to use QuerSet

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 021de19b | View on GitHub

DeclarativeArtiface and DeclarativeContent objects

They were using namedtuples which were not ideal in two ways:

  1. They could not be modified so updating them with new results from the database was problematic.

  2. They couldn't provide validation for required versus optional fields.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 021de19b | View on GitHub

DeclarativeArtiface and DeclarativeContent objects

They were using namedtuples which were not ideal in two ways:

  1. They could not be modified so updating them with new results from the database was problematic.

  2. They couldn't provide validation for required versus optional fields.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 57b8fbae | View on GitHub

Add transactions and progress reporting

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 57b8fbae | View on GitHub

Add transactions and progress reporting

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 51e26e72 | View on GitHub

Replace sleep() with efficient get()

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 51e26e72 | View on GitHub

Replace sleep() with efficient get()

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 5cb7f0dd | View on GitHub

Make content_association a batch operation

With https://pulp.plan.io/issues/3852 the add_content and remove_content interfaces were switched to using querysets instead of item-by-item usage. This updates DeclarativeVersion and its stages to work with that change.

Overall this makes the content_association stage much more efficient.

This PR also adjusts the data type arriving into the content unit association stage to be DeclarativeVersion.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 5cb7f0dd | View on GitHub

Make content_association a batch operation

With https://pulp.plan.io/issues/3852 the add_content and remove_content interfaces were switched to using querysets instead of item-by-item usage. This updates DeclarativeVersion and its stages to work with that change.

Overall this makes the content_association stage much more efficient.

This PR also adjusts the data type arriving into the content unit association stage to be DeclarativeVersion.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 54c628ce | View on GitHub

Adds max_downloads feature to downloader stage

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 54c628ce | View on GitHub

Adds max_downloads feature to downloader stage

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 4523f5fc | View on GitHub

Introducing FirstStage

gmbnomis suggested some changes, so this PR implements them.

https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942

It has a bug in it, which I need to resolve. It also doesn't save the files correctly. I am still making Artifact save compatible with bulk_create()

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 4523f5fc | View on GitHub

Introducing FirstStage

gmbnomis suggested some changes, so this PR implements them.

https://github.com/pulp/pulp_file/commit/c23bec069e12a485000f048e486e1d89cce65beb#r29701942

It has a bug in it, which I need to resolve. It also doesn't save the files correctly. I am still making Artifact save compatible with bulk_create()

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision afbaaefc | View on GitHub

files now save in the backend with bulk_create

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision afbaaefc | View on GitHub

files now save in the backend with bulk_create

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 642e47f6 | View on GitHub

Ok now it's really working for the second sync too.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 642e47f6 | View on GitHub

Ok now it's really working for the second sync too.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision c0ceaeb6 | View on GitHub

Fixes a bug caused by using ints not uuids

autoincrement fields cause you to have to get the db-assigned ids from the database after the first bulk_create(). Otherwise the RemoteContent.content_artifact attributes will be None.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision c0ceaeb6 | View on GitHub

Fixes a bug caused by using ints not uuids

autoincrement fields cause you to have to get the db-assigned ids from the database after the first bulk_create(). Otherwise the RemoteContent.content_artifact attributes will be None.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 0a5ac52d | View on GitHub

Do away with the closures

This replaces the closures with classes that can hold the required values. Closure's aren't entirely needed here.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 0a5ac52d | View on GitHub

Do away with the closures

This replaces the closures with classes that can hold the required values. Closure's aren't entirely needed here.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 304dedeb | View on GitHub

Breaking up huge module into a package of modules

All of the user facing plugin-API things are importable from pulpcore.plugin.stages.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 304dedeb | View on GitHub

Breaking up huge module into a package of modules

All of the user facing plugin-API things are importable from pulpcore.plugin.stages.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 0d16d8a6 | View on GitHub

Some fixes related to file handling

  • redo save() to inspect .pks instead
  • replace the usage of shutil.move() with django's storage backend save() method.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 0d16d8a6 | View on GitHub

Some fixes related to file handling

  • redo save() to inspect .pks instead
  • replace the usage of shutil.move() with django's storage backend save() method.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 082e1e48 | View on GitHub

Updating stages w/ some clearer names

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 082e1e48 | View on GitHub

Updating stages w/ some clearer names

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 34c8536b | View on GitHub

Various fixes from review

Mostly fixes class names for pep8 compliance

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 34c8536b | View on GitHub

Various fixes from review

Mostly fixes class names for pep8 compliance

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision bc6d425e | View on GitHub

All stages are classes now

This is a big diff, but mainly due only to indention

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision bc6d425e | View on GitHub

All stages are classes now

This is a big diff, but mainly due only to indention

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 579c7004 | View on GitHub

Final round of fixes

  • switch storage backend to DefaultStorage()
  • super() is now called first in anytime it's used
  • FirstStage is removed
  • BaseStage -> Stage
  • the relative path now handles duplicate rel_paths correctly
  • documenting args and kwargs

I did some final hand testing, and I also ensure the docs build and look good.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 579c7004 | View on GitHub

Final round of fixes

  • switch storage backend to DefaultStorage()
  • super() is now called first in anytime it's used
  • FirstStage is removed
  • BaseStage -> Stage
  • the relative path now handles duplicate rel_paths correctly
  • documenting args and kwargs

I did some final hand testing, and I also ensure the docs build and look good.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 631031e3 | View on GitHub

Fixing a bug where 2 artifacts are used

Without this commit, an exception will occur when one content unit has two or more Artifacts are associated with it.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision 631031e3 | View on GitHub

Fixing a bug where 2 artifacts are used

Without this commit, an exception will occur when one content unit has two or more Artifacts are associated with it.

https://pulp.plan.io/issues/3844 re #3844

Actions #10

Updated by bmbouter over 6 years ago

  • Status changed from POST to MODIFIED

This is merged.

Added by bmbouter over 6 years ago

Revision ca7925b3 | View on GitHub

Have DeclarativeVersion validate download digests

The downloaders created by DeclarativeVersion didn't specify any digest or size information to use for validation even if it was available. I've tested this code and it worked for me.

https://pulp.plan.io/issues/3844 re #3844

Added by bmbouter over 6 years ago

Revision ca7925b3 | View on GitHub

Have DeclarativeVersion validate download digests

The downloaders created by DeclarativeVersion didn't specify any digest or size information to use for validation even if it was available. I've tested this code and it worked for me.

https://pulp.plan.io/issues/3844 re #3844

Actions #11

Updated by daviddavis over 5 years ago

  • Sprint/Milestone set to 3.0.0
Actions #12

Updated by bmbouter over 5 years ago

  • Tags deleted (Pulp 3)
Actions #13

Updated by bmbouter about 5 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF