Project

Profile

Help

Story #892

closed

Redesign of Uploads API

Added by bmbouter almost 9 years ago. Updated almost 4 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Motivation

Here are some proposed adjustments to the upload API to be simpler. The current upload API is documented here: http://pulp.readthedocs.org/en/latest/dev-guide/integration/rest-api/content/upload.html

These are mostly small changes, but the design is adapted from Dropbox API, who likely has thought about the right way to do uploads. This is only the API part; #923 is written to update the CLI/bindings to match.

Proposed Usage

  1. Send a PUT request to /upload with the first chunk of the file without creating an upload request. An upload_id will be automatically created and returned.
  2. Repeatedly PUT subsequent chunks using the upload_id parameter to identify the upload in progress and an offset representing the number of bytes transferred so far. Both upload_id and offset are GET style parameters.
  3. After each chunk has been committed to disk, the server returns a new offset representing the total amount transferred.
  4. After the last chunk, POST to /import_upload to import the entire file into a repo.

Chunks can be any size up to 150 MB. A typical chunk is 4 MB. Using large chunks will mean fewer calls to /upload and faster overall throughput. However, whenever a transfer is interrupted, you will have to resume at the beginning of the last chunk, so it is often safer to use smaller chunks.

If the offset you submit does not match the expected offset on the server, the server will ignore the request and respond with a 400 error that includes the current offset. To resume upload, seek to the correct offset (in bytes) within the file and then resume uploading from that point. This allows the client to be stateless and attempt to resume uploads by upload_id from the beginning, and rely on the server to tell the client the correct offset to resume from.

Chunks support optional checksums using an additional GET style parameter named sha1sum which is a sha1 checksum of the chunk computed by the client. Upon receiving a chunk that specifies sha1sum, the server verifies the checksum before committing it to disk and returning the 200 OK. If a chunk checksum fails to verify the server responds with a 400 error that indicates the checksum failed to verify.

The /import_upload API call also supports optional checksum verification at the file level and using sha1sum as a POST parameter. If sha1sum is specified, the server verifies the checksum of the file before proceeding with the import. If the checksum fails to verify the server responds with a 400 error that indicates the checksum failed to verify.

A chunked upload can take a maximum of 48 hours before expiring. This will be configurable in server.conf somewhere.

Differences from today

  • You can start uploading and an uploading session is created in case you need chunking, but you don't have to do chunking if you don't actually need it. If you need chunking you do the same operation again, only with an upload_id and offset as GET style params to the same URL. We'll save another URL by not have to have a specific endpoint to create an upload request that is different from where the content is uploaded.
  • Pulp won't have a DELETE API endpoint anymore. Instead Pulp would auto-cleanup with a reaper cleanup that would use timestamps to clean up after the expiration time.
  • Pulp won't support the listing of uploads anymore. It's not that useful, especially since a new one could be started and the old one will be auto cleaned up
  • Checksums at the chunking level is also a new feature which should be useful for large uploads like isos.
  • This implementation allows for the uploading of a single file, and importing into multiple repos without re-uploading. The current design allows for this, but the implementation does not because several upload importers move uploaded files which prevents a separate call to the current API import. This implementation should leave in files in place during all calls to /import_upload and let the auto-delete handle any cleanup later.

API for /upload

Method: PUT

GET style Parameters:

upload_id -- The unique ID of the in-progress upload on the server. If left blank, the server will create a new upload session.

offset -- The byte offset of this chunk, relative to the beginning of the full file. The server will verify that this matches the offset it expects. If it does not, the server will return an error with the expected offset.

sha1sum -- (optional) The sha1 checksum for the chunk. The server will verify the chunk sha1sum. If it does not, the server will return an error indicating the chunk failed to verify.

The body is reserved for the upload content binary data so POST style params are not supported.

Example Response:

{
    "upload_id": "16fd2706-8baf-433b-82eb-8c7fada847da",
    "offset": 31337,
    "expires": "Tue, 19 Jul 2011 21:55:38 +0000"
}

API for /import_upload

Method: POST

POST parameters are divided into two types: platform parameters and plugin specific parameters:

Platform POST style Parameters:

upload_id (string) - identifies the upload request being imported
sha1sum (string) -- (optional) The sha1 checksum for the file. The server will verify the file sha1sum. If it does not, the server will return an error indicating the file failed to verify.

Plugin POST style parameters:

These are optional because they aren't required on all plugins by definition. These are some examples, but each plugin will document and specify its own set of parameters.

unit_type_id (string) - identifies the type of unit the upload represents
unit_key (object) - unique identifier for the new unit; the contents are contingent on the type of unit being uploaded
unit_metadata (object) - (optional) extra metadata describing the unit; the contents will vary based on the importer handling the import
override_config (object) - (optional) importer configuration values that override the importer’s default configuration

A 202 OK or error will be returned as it is imported asynchronously. Importing will leave the upload file in place in case the user wants to import again into other repos using the upload interface. If not then auto cleanup will take care of the vestige upload_id.


Related issues

Blocks Pulp - Story #894: Consolidate upload functionality of CLI and bindingsCLOSED - WONTFIX

Actions
Blocks Pulp - Task #923: Update CLI and bindings to to use new-style upload API in story #892CLOSED - WONTFIX

Actions
Blocks Pulp - Issue #636: API delete nonexisting upload_id returns 200CLOSED - WONTFIXActions
Actions #1

Updated by bmbouter almost 9 years ago

  • Description updated (diff)
  • Category set to 14
Actions #2

Updated by bmbouter almost 9 years ago

  • Blocks Story #894: Consolidate upload functionality of CLI and bindings added
Actions #3

Updated by bmbouter almost 9 years ago

  • Description updated (diff)
Actions #4

Updated by bmbouter almost 9 years ago

  • Blocks Task #923: Update CLI and bindings to to use new-style upload API in story #892 added
Actions #5

Updated by bmbouter almost 9 years ago

  • Description updated (diff)
Actions #6

Updated by ipanova@redhat.com over 8 years ago

  • Blocks Issue #636: API delete nonexisting upload_id returns 200 added
Actions #7

Updated by mihai.ibanescu@gmail.com over 8 years ago

For uploading units, have you considered using Content-Range headers, as defined in http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.16 ?

If the client cannot upload a unit in a single chunk, an alternative way could be:

  • The first request the client sends includes a Content-Range header with the full length specified. Assuming a unit 256M long, the first request uplading a 1M chunk could include:

Content-Range: bytes 0-1048575/268435456

The second could do:

Content-Range: bytes 1048576-2097151/268435456

and so on.

Several advantages:

  • You can theoretically upload the chunks in parallel. You don't have to require chunks to be uploaded sequentially
  • You can do some level of checking that subsequent chunks specify the same length as the first one. If on the server side you chose to implement unit uploads with sparse files, as soon as you see the first request you can create the sparse file, and future chunk uploads can compare the size of the file
  • You can require the Content-MD5 HTTP header to provide checksumming for each chunk
  • If you deal with sparse files (because ISOs can be sparse), the client can choose to skip over the zeros and not upload them at all

The disadvantage is that you don't quite know when the client is done uploading; you'd need some state. You can probably add an extra parameter for a sha256sum of the whole unit, and at unit association time you can verify if the representation on disk does indeed match. But that can be fixed in other ways.

Other random thoughts, for those who are REST purists:

  1. random arguments embedded in URLs are generally frowned upon. They tend to break caching, since proxies cannot determine if they're the same resource. They also hide what the implementation was supposed to do.
  2. PUT requests should be idempotent. If I specify the same PUT request twice (or more), I expect to have the same result as doing it just once. It is not clear to me in your "Proposed usage", step 1 if that would be the case, but to me that should be a POST request, not a PUT.
  3. you're not really PUTting a unit resource, that would be dangerous as it would allow one to mutate an existing unit. From a logical perspective, the unit repository only supports CREATE and DELETE, and no UPDATE. So you're still dealing with an upload session of sorts. As you described above, an upload session may expire, but it would be nice if as a client I can remove a session
  4. it would be nice if the API would tell me if the unit I am about to import already exists, and tell me where the unit is

So, the full workflow might be:
POST /uploads
Request body:

  • sha256sum
  • size
    Response body:
  • if a unit with that sha256sum already exists, a 303 (See Other) status code could be returned, with the URI to the unit specified in the Location header. No additional resource will be created
  • if the unit does not exist, a 201 (Created) is returned, with the json body:
    • content: Link URL to /uploads/<uuid>/content; methods allowed: PUT
    • status: Link URL to /uploads/<uuid>/status; methods allowed: GET, PUT

PUT /uploads/<uuid>/content
Request headers: Content-Range as suggested above, if chunked
Request body: full unit or chunk
Response: 200/201 (or error)
Response body: empty if successful

PUT /uploads/<uuid>/status
Request body: "finished"
Response: 200/201 (or error)

When the client knows that all chunks have been uploaded, it will run the PUT operation that will copy the unit in its final location after verifying the checksum.

I am running out of time, and I haven't thought this through, but please let me know if these bring any new idea that might help.

Actions #8

Updated by dgregor@redhat.com over 8 years ago

When importing into multiple repos, how would the situation be handled where some imports succeed and others fail? For example, in our deployment we have a list of allowed RPM signatures specific to each repo. This prevents an RPM with a "beta" signature ending up in a "gold" repo. If I upload a beta RPM and try to import it into a mix of beta and gold repos, some would succeed and others would fail.

Actions #9

Updated by bmbouter almost 5 years ago

  • Status changed from NEW to CLOSED - WONTFIX
Actions #10

Updated by bmbouter almost 5 years ago

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

Actions #11

Updated by bmbouter almost 5 years ago

  • Tags Pulp 2 added
Actions #12

Updated by bmbouter almost 4 years ago

  • Category deleted (14)

We are removing the 'API' category per open floor discussion June 16, 2020.

Also available in: Atom PDF