Project

Profile

Help

Refactor #4132

closed

Metadata is not downloaded in parallel

Added by amacdona@redhat.com about 6 years ago. Updated about 4 years ago.

Status:
CLOSED - NOTABUG
Priority:
High
Assignee:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Target Release - Python:
Groomed:
Yes
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

Metadata for the python plugin is all retrieved in the PythonFirstStage[0]. Project level metadata is downloaded in a for loop [1]. Since we are using asyncio, the `await` does cede control while it is downloading a metadata file, so after the first metadata file is downloaded and processed, the PythonFirstStage continues to download 1 metadata file at a time while all later stages from pulpcore continue.

My assumption is that downloading the project level metadata in parallel would be a relatively small performance improvement when each project (corresponding to 1 metadata file) has many Python distributions associated with it. However, for lazy sync or cases where each project has a small number of Python distributions, downloading the metadata in parallel could be a very large performance increase.

Background

Unless a Stage implements parallel calls with asyncio, each Stage only operates 1 at a time (or 1 batch at a time). The only pulpcore Stage that runs multiple calls in parallel is the ArtifactDownloader[1], which uses the ArtifactDownloaderRunner to ensure_futures[2], which is what enables many calls to happen simultaneously.[2]

Design Options

We can download the metadata in parallel in at least 2 ways.

  1. Split up the PythonFirstStage. Even though the Python plugin doesn't have a "Project" Content unit, we can still create DeclarativeContent objects that can flow to later stages.
    1. ProjectListStage - Rather than downloading metadata in the first stage[3], we could create a DeclarativeContent object for each project, and out_q.put(dc).
    2. ArtifactDownloader (from core, unchanged) - This would download (in parallel) each of the project metadata files.
    3. ProcessProjectMetadata -This Stage would open and read the project dc.d_artifact file, and create dcs for PythonPackageContent, and out_q.put(python_package_content_dc). It would not continue to pass project dcs down the pipeline. This would roughly correspond to part of the first stage[4], but would also need to include package filtering[5].
    4. Suggested Pipeline: ProjectListStage -> ArtifactDownloader -> ProcessProjectMetadata -> ArtifactDownloader -> save artifacts, save content,etc
  2. It's probably also possible to implement parallel downloads directly into a monolithic first stage by using asyncio.ensure_future, but IMO this should not be idiomatic use of the Stages API. It would be complex and would require knowledge of asyncio that is not really expected of plugin writers. Besides, the tricky part of this is already implemented by the ArtifactDownloader.

[0]: https://github.com/pulp/pulp_python/blob/master/pulp_python/app/tasks/sync.py#L57
[1]: https://github.com/pulp/pulp/blob/master/plugin/pulpcore/plugin/stages/artifact_stages.py#L215
[2]: https://github.com/pulp/pulp/blob/master/plugin/pulpcore/plugin/stages/artifact_stages.py#L121-L132
[3]: https://github.com/pulp/pulp_python/blob/master/pulp_python/app/tasks/sync.py#L96
[4]: https://github.com/pulp/pulp_python/blob/master/pulp_python/app/tasks/sync.py#L121-L129
[5]: https://github.com/pulp/pulp_python/blob/master/pulp_python/app/tasks/sync.py#L151


Related issues

Related to Python Support - Refactor #6930: Use Bandersnatch to perform package metadata fetching and filteringMODIFIEDgerrod

Actions
Actions #1

Updated by CodeHeeler about 6 years ago

  • Tracker changed from Issue to Refactor
  • % Done set to 0
Actions #2

Updated by amacdona@redhat.com about 6 years ago

  • Sprint Candidate changed from No to Yes
  • Tags Pulp 3 added
Actions #3

Updated by amacdona@redhat.com about 6 years ago

  • Blocks Issue #1183: As a developer, I can close this ticket as a duplicate of 1884 :) added
Actions #4

Updated by dalley about 6 years ago

  • Groomed changed from No to Yes
Actions #5

Updated by amacdona@redhat.com about 6 years ago

  • Blocks Story #1884: As a user, I can lazily sync python packages added
Actions #6

Updated by amacdona@redhat.com about 6 years ago

  • Blocks deleted (Issue #1183: As a developer, I can close this ticket as a duplicate of 1884 :))
Actions #7

Updated by rchan almost 6 years ago

  • Sprint Candidate changed from Yes to No

Remove sprint candidate flag. @Dana will keep this in mind as a good task for learning when no other higher priority work is on sprint.

Actions #8

Updated by amacdona@redhat.com over 5 years ago

  • Blocks deleted (Story #1884: As a user, I can lazily sync python packages)
Actions #9

Updated by amacdona@redhat.com over 5 years ago

Instead of the refactor proposed here, I think we can just alter these downloads to happen during the first stage, but in parallel. If we do it that way, this no longer needs to block #1884

Actions #10

Updated by amacdona@redhat.com over 5 years ago

  • Sprint/Milestone set to 3.0 GA
Actions #11

Updated by bmbouter over 5 years ago

  • Tags deleted (Pulp 3)
Actions #12

Updated by dalley over 4 years ago

  • Priority changed from Normal to High
Actions #13

Updated by dalley over 4 years ago

  • Related to Refactor #6930: Use Bandersnatch to perform package metadata fetching and filtering added
Actions #14

Updated by dalley about 4 years ago

  • Status changed from NEW to CLOSED - NOTABUG

Also available in: Atom PDF