Project

Profile

Help

Task #1940

Determine if package metadata should be updated after download

Added by mhrivnak over 5 years ago. Updated almost 3 years ago.

Status:
CLOSED - NOTABUG
Priority:
Normal
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Target Release - Python:
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 3
Quarter:

Description

Credit to @amacdona, who actually wrote this in email. I'm just moving it to redmine.

I (@amacdona) have discussed this issue with a few of you, but I want to open it up
to the entire team.

The python plugin used to crack open the metadata files and use the
information there to populate the unit's metadata. For the upload case
we should be able to generate all the necessary information from the
packages themselves using the library twine.

Since we are moving towards the ability to lazily sync Python
repositories, there is a requirement that we can generate units using
only metadata that is available before downloading the bits.

Option 1. Continue to use only the metadata from PyPI. This will lead
nicely to the lazy work, the model will continue to be minimal.
Option 2. Create minimal units from PyPI and when the packages are
downloaded, inspect them (using the same twine library that will be used
in upload) and use that metadata to populate a more informative model.

It is necessary to give a little background information for why this
choice matters. The PyPI model is structured differently than ours and
because of this, some of the information on each package is lost when
packages are grouped into projects. I have a more detailed explanation
in our python model docs. [0] The point of all of this is that an older
release may have different metadata than a new release, but this
information is not accessible through the PyPI API, it is only
accessible by inspecting the files.

Benefits of Option 1:
1. Faster. We no longer need to touch the files.
2. Metadata is consistent with PyPI's API
3. Packages are consistent from the time they are created.

Benefits of Option 2:
1. We can include more metadata in the model.
2. The metadata of the package is consistent with the snapshot of
metadata at the time of package release.
3. Metadata will be consistent with the same package if it were
uploaded rather than synced.

I am leaning toward Option 1, but I would like to hear everyone's
feedback first.

[0]
https://github.com/asmacdo/pulp_python/blob/08f8f76656de89818fc7429b2c022f4634eaea77/plugins/pulp_python/plugins/models.py#L24

History

#1 Updated by mhrivnak over 5 years ago

Quick background: In general, I think pulp should avoid the temptation to store more metadata on units than users will actually find valuable. We took the "kitchen sink" approach with RPMs, and I think that has not paid off. Figuring out which pieces of metadata are useful is subjective, but I would err on the side of starting from a minimal model, and then come up with a justification for each additional field you want to add.

Given that, I do not think Option 2 Benefit 1 is necessarily compelling.

I also lean in favor of focusing on the API's metadata, and for uploaded packages, figure out how to get as close to that same data as we can.

It sounds like option 2 might depend on downloading the file? How would that work for doing a publish of non-downloaded units?

#2 Updated by amacdona@redhat.com over 5 years ago

Given that, I do not think Option 2 Benefit 1 is necessarily compelling.

Agreed

It sounds like option 2 might depend on downloading the file? How would that work for doing a publish of non-downloaded units?

For non-downloaded units, we just publish what we have, which will be complete enough for functionality but will be missing the extra metadata.

I also lean in favor of focusing on the API's metadata, and for uploaded packages, figure out how to get as close to that same data as we can.

We will have inconsistency in one way or the other. Either uploaded packages can be slightly different from their API metadata counterparts (I think this is not that bad) or data within Pulp itself will change depending on whether the bits have been downloaded yet. (This could be confusing)

#3 Updated by rbarlow over 5 years ago

Be careful of legal information, since the license is part of the
package metadata. With option 1, packages that change licenses will be
misreported. jcline and I have witnessed quite a few license changes
during our Erlang packaging too (and one of my Fedora Python packages
changed), so this is not only a possibility but something I'd say is
guaranteed to happen.

It's important for Pulp to track correct metadata for each version of a
package and for it not to misinform users by putting the newest metadata
on old packages.

My suggestion is to leave the data null until the package has been
fetched during lazy, or to simply not include any metadata that could
change between versions.

#4 Updated by amacdona@redhat.com over 5 years ago

@rbarlow - Yes, I agree that could be a problem. This is why (with option 1 in mind) I have simply removed the license field from our model. I would be comfortable adding that field back in only if we go with Option 2.

#5 Updated by amacdona@redhat.com over 5 years ago

  • Status changed from ASSIGNED to CLOSED - NOTABUG
  • Sprint/Milestone set to 21

I have opted to create packages using only the metadata without updating after download.

#6 Updated by bmbouter almost 4 years ago

  • Sprint set to Sprint 3

#7 Updated by bmbouter almost 4 years ago

  • Sprint/Milestone deleted (21)

#8 Updated by bmbouter almost 3 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF