Story #4690

Updated by dalley over 1 year ago

Currently our "includes" field looks like this:

{ "name": "django",
"version_specifier": "~=2.0,!=2.0.1",
{"type": "sha256",
"digest": "3d9916515599f757043c690ae2b5ea28666afa09779636351da505396cbb2f19"}
{"name": "pip-tools",
"version_specifier": ">=1.12,<=2.0"},
{"name": "scipy",
{"type": "md5",
"digest": "044af71389ac2ad3d3ece24d0baf4c07"},
{"type": "sha256",
"digest": "18b572502ce0b17e3b4bfe50dcaea414a98290358a2fa080c36066ba0651ec14"}]
{"name": "shelf-reader"}

These "digests" are a bit of an ergonomic problem, and the tooling to make them useful as something the user manually inputs doesn't exist. Further, the way Pulp works is inconsistent with the tooling that *does* exist.

*Ergonomic problem:*

While the requirements.txt format does support specifying hashes, there's no easy way to get them, since no tools give you a way to output them. Pip freeze unfortunately doesn't support it, so you have to either parse the API output, or hash the files yourself, or use a tool like pipenv or poetry to get them. So while we might eventually want to support this, I don't think it makes any sense to expose it to the user manually. It can be a feature that only gets used when ingesting a pipenv.lock or similar kind of file.

*Existing behavior is inconsistent with how other tooling treats them*

For all the tools which do support digests, they are only defined for pinned versions. E.g. it is completely invalid to do this:

shelf-reader>=0.1 --hash=sha256:04cfd8bb4f843e35d51bfdef2035109bdea831b55a57c3e6a154d14be116398c

It will fail with the following error:

In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
shelf-reader>=0.1 from (from -r requirements.txt (line 1))

We currently allow adding digests to *any specifier,* which is inconsistent with the broader ecosystem. No tooling will output a specifier like this, and I don't see a reason for a user to do so manually.

Similarly, pip will not allow you to specify hashes for some packages and not for others, *including* for their dependencies. In other words, it is invalid to do something like:

Django==2.1 --hash=sha256:04cfd8bb4f843e35d51bfdef2035109bdea831b55a57c3e6a154d14be116398c

This is invalid because, as the Pip docs describe, enabling hash-checking mode is an all-or-nothing proposition. Django pulls in dependencies such as pytz, and because those hashes are not also defined, installing that requirements file will fail. In this sense, Pulp is also acting inconsistently with the broader ecosystem by allowing digests to be specified for some packages but not for others.

This makes complete sense when you think about it. We don't want to provide anything less than a full guarantee to users who want the reproducability use case. There's little use in specifying one single package by digest when all of it's dependencies could be different. The all-or-nothing nature of the tooling likely better matches the desires of those users moreso than what Pulp is currently doing.


I think that the way this feature is currently implemented is not going to work out long-term, and I also feel like it's probably not something we should have tried to implement as early as we did. I think we should remove "digests" as a user-facing concept for the moment, and _consider_ and deleting the sync code that deals with it, to bring it back later. Because there would be no way to test if it's getting broken by further changes. This one change would probably remove half the lines in our sync implementation, which I think is a virtue at this point in development.

Long term, here is how I think we should implement it:

# Either make a self-contained object that can be created from pipenv.lock, or something similar, which contains a full listing of all packages, their versions, and their digests being used, and allow it to be linked to the remote and supplement or take the place of the remote's own include and exclude settings when being used, settings, or

# Make a separate field called "locked_versions", that would supplement or take the place of the includes or excludes fields when being used

Both of these designs would:

* Simplify the sync codepaths *dramatically*
* Enforce the fact that digests only make sense in the context of a single release
* Provide more guarantees to the repo owner that when they are using digests for reproducability, they have a solid guarantee that the entire contents of the repository is the reproducable set

They would also:

* Make it possible to simplify the format of "includes" and "excludes".
* We could let them accept an array of requirements.txt-style strings instead of dictionaries. e.g. includes:='["django>=1,<=2.1,==2.3", "flask==2.0.1", "bottle~=3.2"]'
* This might be a little harder to validate so maybe we wouldn't do this, but it's possible and I think it would make Pulp feel a lot more familiar
* Open up the possibility for adding various strict modes
* maybe in strict mode, you're only allowed to have one release per project (with or without providing digests)
* maybe in strict mode, you're *only* allowed to have locked versions on the remote (no using includes/excludes), and providing hashes is enforced for all locked versions, similar to how pip behaves in hash-checking mode