Story #3167: Eliminate the need for Crane's in-memory database of images - Docker Support

Actions

Send by e-mail Copy link

Story #3167

closed

Eliminate the need for Crane's in-memory database of images

Added by mihai.ibanescu@gmail.com over 6 years ago. Updated about 5 years ago.

Status:

CLOSED - WONTFIX

Priority:

Normal

Assignee:

Start date:

Due date:

% Done:

Estimated time:

Platform Release:

Target Release - Docker:

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Quarter:

Description

I believe that, at least for Docker v2, careful layout of the json files generated by pulp_docker will render obsolete the in-memory database that Crane has to generate by polling the filesystem and loading .json files.

I have not looked into v1, but my understanding is it won't be supported anymore.

Details¶

The V2 REST API is documented here:

https://docs.docker.com/registry/spec/api/#detail

In that document, <name> refers to a repository+image name. In pulp_docker, this represents the repo-registry-id setting on the distributor's config, and if unset, it defaults to <pulp_repo_id>. In Crane's v2 view, this is referred to as name_component.

Right now, v2 json files are produced under /var/lib/pulp/published/docker/v2/app/ (assuming data_dir is /var/lib/pulp/published/docker/ in /etc/crane.conf). In that directory there is one json file per Pulp repository, and it is named <pulp_repo_id>.json.

If we went to a (potentially) deeper directory structure like <repo-registry-id>.json, then Crane could just try to find the redirect file in <name>.json after it splits out the <name> portion from the request URL. This could be performed in repository.get_schema2_data_for_repo(name_component) which is being called from crane/views/v2.py:name_redirect.

Example¶

create pulp repository with id my-lamp, with repo-registry-id=mibanescu/lamp; upload a v2 image and tag it as latest
publishing the pulp repository creates the redirect file at /var/lib/pulp/published/docker/v2/app/mibanescu/lamp.json
crane is set up with data_dir=/var/lib/pulp/published/docker in /etc/crane.conf
crane receives request from docker client: GET https://registry.example.com/v2/mibanescu/lamp/manifests/latest
crane extracts name_component=mibanescu/lamp
crane looks for a file named <name_component>.json under data_dir, which expands to /var/lib/pulp/published/docker/v2/app/mibanescu/lamp.json - without the need to have a database, in memory or otherwise, to tell it the repo has been published
crane reads the url in the json file and issues the redirect, just like it currently does

Limitations¶

The search catalog cannot be generated without walking the filesystem.

Actions

Copy link

Updated by mihai.ibanescu@gmail.com over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by mihai.ibanescu@gmail.com over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by mihai.ibanescu@gmail.com over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by dkliban@redhat.com over 6 years ago

I believe that the other piece of information that is stored in the json files is the redirect URL that specifies which hostname should be used to form the redirect URL. How could we teach crane about that?

Actions

Copy link

Updated by mihai.ibanescu@gmail.com over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by ipanova@redhat.com over 6 years ago

1) v1

even if i really want to get rid of v1 we cannot do it with pulp2 line. It will be dropped on pulp3 though

2) That is not true that repo can serve whether v2 content or v1. It actually can server both. That why we create json file for v1 and json file for v2

3) in order to expand to where json file is, we need to know the exact location of it. If we implement the suggested approach, after upgrade, thing will stop working for people- because the json file would be located in the root directory of in the subdirectories.. That's a breaking change.

I know that is annoying to walk through the nested dir, but we are loading just when something changed in the data_dir.

So my question would be - what is more frequent: data that changes on the registry, or docker pull by clients?
How will you handle multiple requests from clients to the same file? if you have docker pull lamp from 1000 clients you will have 1000 times access to the fs. Besides that you can have docker pulls from other clients to other repos, and also meanwhile redirects to blobs and manifests.

We are weighting between number of requests to the fs, which is shared, can be overloaded and have quirks with cash and the time when it is flashed and keeping some data in memory.

Actions

Copy link

Updated by mihai.ibanescu@gmail.com over 6 years ago

Let's focus on whether this is even technically possible, before we argue pulp2 vs. pulp3.

If filesystem access is a concern (which to me is not), you can always build an in-memory cache. That is different from an in-memory database, without which Crane doesn't work at all.

My distinction between a database and a cache is:

without the former things don't work
without the latter things run slower

This ticket is about eliminating the database. That the database happens to be also acting as a cache is true, but not relevant to whether it's possible or not to get rid of it.

Actions

Copy link

Updated by ipanova@redhat.com over 6 years ago

we had a personal discussion with misa . Summary is:

- this story could be accepted for pulp3. it cannot be accepted for pulp2 because it will be a breaking change for existing users. We give the flexibility to the users to have the app files optionally be in the root data_dir, or in subdirectories as desired. defaults to /var/lib/crane/metadata/. This prevents us to calculate the path as it is suggested in the story.
- after discovering that in data_dir it is possible to provide the exact path to the directory from which metadata will be loaded, which is mentioned in docs [0] [1] And since just v2 will be served it will eliminate the need to recursively walk the directory. Misa is happy.
- for pulp3. we would need to evaluate which approach is better - to whether load all the files at once, or load file per request ( where we'd need to check if it is a existing file with os module or use the try/except on json.load attempt). Also, if we plan to make search working then we should figure out how to make this compatible.

[0] https://docs.pulpproject.org/plugins/crane/index.html#configuration
[1] https://docs.pulpproject.org/plugins/pulp_docker/user-guide/recipes.html#configuring-crane-with-pulp-docker

Actions

Copy link

Updated by bmbouter about 5 years ago

Status changed from NEW to CLOSED - WONTFIX

Actions

Copy link

#10

Updated by bmbouter about 5 years ago

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

Actions

Copy link

#11

Updated by bmbouter about 5 years ago

Tags Pulp 2 added

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Docker Support

Agile boards

Custom queries

Story #3167

Eliminate the need for Crane's in-memory database of images

Details¶

Example¶

Limitations¶

Updated by mihai.ibanescu@gmail.com over 6 years ago

Updated by mihai.ibanescu@gmail.com over 6 years ago

Updated by mihai.ibanescu@gmail.com over 6 years ago

Updated by dkliban@redhat.com over 6 years ago

Updated by mihai.ibanescu@gmail.com over 6 years ago

Updated by ipanova@redhat.com over 6 years ago

Updated by mihai.ibanescu@gmail.com over 6 years ago

Updated by ipanova@redhat.com over 6 years ago

Updated by bmbouter about 5 years ago

Updated by bmbouter about 5 years ago

Updated by bmbouter about 5 years ago