Project

Profile

Help

Task #2883

Updated by bizhang over 2 years ago


A content model, content serializer and content ViewSet will have been already created by https://pulp.plan.io/issues/2882

This task is to finish those classes, adding any Python specific fields.

This task will be complete when a django shell user can CRUD full representations of Python Package Releases. A REST API user should be able to read a list of all Python units `/v3/content/python/` as well as retrieve data on a specific unit (url is not yet decided).

All unit metadata is provided by the shell user at this point. It is not expected that the plugin extract the metadata from a package or scrape it from upstream.

After discussion we will go with the python distribution package as h2. Content Model

There are two ways of
content unit model.

The PackageContent (because it's not really a PythonContent, and DistributionContent would overload
modeling being discussed right now. For compactness I'm going to call the term 'distribution' too much) would contain the following fields:

|packagetype |
metadata fields _additional metadata_:
|path |
|filename (primary key) |

|name |
|version|
|metadata_version|
|summary|
|Description|
|Keywords |
|Home-page|
|Download-URL|
|Author|
|Author-email|
|Maintainer |
|Maintainer-email|
|License |
|Classifier |
|Requires-Python|
|Project-URL |
|platform|
|download_url|

h3. Python Distribution Package as Content Unit


This is they way Pulp2 is modeled currently. Each A content unit would look like:
|filename (primary key)|
|name |
|version |
|metadata_version|
|packagetype |
|path |
|additional metadata |

And each content unit would
contain one artifact corresponding to the filename distribution package on PyPI.

h4. Disadvantages

The disadvantage of modeling a python distribution package as a content unit is that this is something the user would not care as much about. We would have multiple content units for the same release, but for different systems:
eg.
scipy-0.9.0-cp26-cp26mu-manylinux1_x86_64.whl
scipy-0.9.0-cp27-cp27m-manylinux1_x86_64.whl
scipy-0.9.0-cp27-cp27mu-manylinux1_x86_64.whl
scipy-0.9.0.tar.gz
scipy-0.9.0.zip

As a user I do not want to view all these distribution packages when I query a repository. The only thing I would care about is the release, and I will let pip take care of which distribution package to install. PyPI in particular makes the release a first class citizen instead of the distribution packages.

Metadata that belongs to a release (i.e. additional metadata) would be repeated across content units. PyPI stores these metadata fields as a part of the release [0], and these fields could be updated in PyPI outside of a release. The metadata we store would be the metadata in a distirbution package, which is immutable, so if a user updates metadata in PyPI, we would not sync the metadata updates.

h3. Python Release as a Content Unit

The alternative is to model a python release as a content unit. A content unit would look like:
|name |
|version|
|metadata_version|
|additional metadata |

Where the primary key would be (name, version)

And the Distribution Package would map to an artifact, and it would have the following fields:
|packagetype |
|path |
|filename |

Each Content unit would contain all the distribution releases associated with it.

h4. Disadvantages

This way of modeling works really well for the super easy use case of "I just want to sync everything", but it begins to break down when we consider the I only want the wheels use case.

If I want one repository to contain all the sdists and one for all the wheels, this implementation as is would not allow that (since content units are immutable, we can't have one scipiv.0.9.0 sdist only content unit, and another wheels only). @asmacdo has proposed a partial sync workaround, which he will update this issue with.

I will note however that even with partial sync, we will have to display the metadata for all the artifacts otherwise the pulp <-> pulp sync will leave out known artifacts.

h3. Glossary

Release
A snapshot of a Project at a particular point in time, denoted by a version identifier.
Making a release may entail the publishing of multiple Distributions. For example, if version 1.0 of a project was released, it could be available in both a source distribution format and a Windows installer distribution format.

Distribution Package
A versioned archive file that contains Python packages, modules, and other resource files that are used to distribute a Release. The archive file is what an end-user will download from the internet and install. A project may contain many releases, and releases may contain many distribution packages. Can be type sdist, bdist, etc. "Distribution package" is used instead of "package" to avoid confusion with "import packages" or linux "distributions".

[0] https://warehouse.pypa.io/api-reference/xml-rpc/

Back