Story #3844: As a plugin writer, I can use and customize a declarative, concurrent pipeline

Story #3844

h2. Motivation 

 There are several use cases plugin writers have a hard time fulfilling easily with the current plguin API. There are distinct issues, but could represent a collective opportunity for resolution. 


 h3. Customization Use Cases 

 In PR discussion, @bmgnomis brought up "an example":https://github.com/pulp/pulp/pull/3483#issuecomment-390170199 where he needs to do make additional http calls for content units to be newly created during sync. If he is using a declarative interface he specifically isn't trying to determine on his own which content units need to be created versus those that exist. Making these calls later is both inefficient and could lead to correctness problems if a fatal exception is encountered after being saved with partial information. 

 The ideal functionality would be akin to adding a "step" in the middle. 


 h3. Related data Use Cases 

 We've seen several examples of a Content units, e.g. AnsibleRoleVersion, that have ForeignKey relationships to other non-content unit data, e.g. AnsibleRole. During saving newly created AnsibleRoleVersion data may need to be related to existing AnsibleRole data and the generic core machinery doesn't know how to do that. Also different plugins may want different behaviors. 


 h3. Validation Use Cases 

 Plugin writer's may want to prevent saving of new content units or Artifacts if they fail certain validation. For example when adding a new Content Unit, e.g. AnsibleRoleVersion, lint checks could be run on it to ensure it's quality meets the requirements. 


 h3. Stream based end-to-end Use Case 

 The plugin writer wants to be able to start processing units (downloading, querying the db, saving, etc) without "all units" being available. This should include the downloading and fetching of initial metadata. 


 h3. Declarative Use Case 

 To make plugin writer code as easy as possible, having them declare that state of the remote repository and having the core code do the rest is ideal. 


 h3. Concurrency Use Case 

 Each of the stream processing steps should be able to be efficiently run concurrently. Also we want this concurrency to mix well with the concurrency already used by the downloaders (asyncio). 


 h2. Possible Resolution 

 Use the producer consumer pattern of asyncio to create a linear pipeline of asyncio stages to create a RepositoryVersion from a stream of unsaved content units and unsaved Artifacts. Plugin writers can inject new, custom stages, reorganize/reuse existing ones, or remove stages to get the stream processing they need. 

 Overall Design Diagram:    https://i.imgur.com/7cEXC5e.png 

 The design has 3 parts in the pulp/pulp PR. 

 a) The Stages API itself which is effectively the <code>make_pipeline()</code> method 
 b) All of the stages that are already compatible with the Stages API. This is most of the code 
 c) DeclarativeVersion, an object which assembles a specific pipeline that can provide both sync_mode='additive/mirror' and lazy mode support without customization. 

 Core code is here:    https://github.com/pulp/pulp/compare/master...bmbouter:introducing-asyncio-stages 

 The pulp_file code is here: https://github.com/pulp/pulp_file/compare/master...bmbouter:introducing-asyncio-stages 

 h2. Todo list 

 * Tune the pipeline some. the Queue maxsize=100 may be too small. 
 * Add a limiter to the Artifact download stage that restricts the number of Artifact downloads in-flight. 
 * Update to use the bulk-create updates from https://pulp.plan.io/issues/3814 
 * Update to use the bulk-create updates from https://pulp.plan.io/issues/3813 
 * Add some docs
Back
Project

Profile

Help

Pulp

Story #3844