Story #892

Updated by bmbouter over 8 years ago

h3. Motivation 

 Here are some proposed adjustments to the upload API to be simpler. The current upload API is documented here: 

 These are mostly small changes, minor change, but the design is adapted from "Dropbox API":, who likely has thought about the right way to do uploads. This is only the API part a separate story will be written to extend the CLI/bindings to match. 

 h3. Proposed Usage Typical usage 

 # Send a PUT request to /upload with the first chunk of the file without creating setting upload_id, and receive an upload request. An upload_id will be automatically created and returned. in return. 
 # Repeatedly PUT subsequent chunks using the upload_id parameter to identify the upload in progress and an offset representing the number of bytes transferred so far. Both upload_id and offset are GET style parameters. 
 # After each chunk has been committed to disk, uploaded, the server returns a new offset representing the total amount transferred. 
 # After the last chunk, POST to /import_upload to import complete the entire file into a repo. upload. 

 Chunks can be any size up to 150 MB. A typical chunk is 4 MB. Using large chunks will mean fewer calls to /upload and faster overall throughput. However, whenever a transfer is interrupted, you will have to resume at the beginning of the last chunk, so it is often safer to use smaller chunks. 

 If the offset you submit does not match the expected offset on the server, the server will ignore the request and respond with a 400 error that includes the current offset. To resume upload, seek to the correct offset (in bytes) within the file and then resume uploading from that point. This allows the client to be stateless and attempt to resume uploads by upload_id from the beginning, and rely on the server to tell the client the correct offset to resume from. 

 Chunks support optional checksums using an additional GET style parameter named sha1sum which is a sha1 checksum of the chunk computed by the client. Upon receiving a chunk that specifies sha1sum, the server verifies the checksum before committing it to disk and returning the 200 OK. If a chunk checksum fails to verify the server responds with a 400 error that indicates the checksum failed to verify. 

 The /import_upload API call also supports optional checksum verification at the file level and using sha1sum as a POST parameter. If sha1sum is specified, the server verifies the checksum of the file before proceeding with the import. If the checksum fails to verify the server responds with a 400 error that indicates the checksum failed to verify. 

 A chunked upload can take a maximum of 48 hours before expiring. This will be configurable in server.conf somewhere. 

 configurable, but what the setting should be called still needs input. 

 h3. Differences from today 

 * You can start uploading and an uploading session is created in case you need chunking, but you don't have to do chunking if you don't actually need it. If you need chunking you do the same operation again, only with an upload_id and offset as GET style params to the same URL. We'll save another URL by not have to have a specific endpoint to create an upload request that is different from where the content is uploaded. 

 * Pulp won't have a DELETE API endpoint anymore. Instead Pulp would auto-cleanup with a reaper cleanup that would use timestamps to clean up after the expiration time. 

 * Pulp won't support the listing of uploads anymore. It's not that useful, especially since a new one could be started and the old one will be auto cleaned up 

 * Checksums at the chunking level is also a new feature which should be useful for large uploads like isos. 

 * This implementation allows for the uploading of a single file, and importing into multiple repos without re-uploading. The current design allows for this, but the implementation does not because several upload importers move uploaded files which prevents a separate call to the current API import. This implementation should leave in files in place during all calls to /import_upload and let the auto-delete handle any cleanup later. 

 h2. API for /upload 

 Method: PUT 

 GET style Parameters: 
 upload_id -- The unique ID of the in-progress upload on the server. If left blank, the server will create a new upload session. 

 offset -- The byte offset of this chunk, relative to the beginning of the full file. The server will verify that this matches the offset it expects. If it does not, the server will return an error with the expected offset. 

 sha1sum -- (optional) The sha1 checksum for the chunk. The server will verify the chunk sha1sum. If it does not, the server will return an error indicating the chunk failed to verify. 

 The body is reserved for the upload content binary data so POST style params are not supported. 

 Example Response: 
     "upload_id": "16fd2706-8baf-433b-82eb-8c7fada847da", 
     "offset": 31337, 
     "expires": "Tue, 19 Jul 2011 21:55:38 +0000" 

 h2. API for /import_upload 

 Method: POST 

 POST parameters are divided into two types: platform parameters and plugin specific parameters: 

 Platform POST style Parameters: 

 upload_id (string) - identifies the upload request being imported 
 sha1sum (string) -- (optional) The sha1 checksum for the file. The server will verify the file sha1sum. If it does not, the server will return an error indicating the file failed to verify. 

 Plugin POST style parameters: 

 These are optional because they aren't required on all plugins by definition. These are some examples, but each plugin will document and specify its own set of parameters. 

 unit_type_id (string) - identifies the type of unit the upload represents 
 unit_key (object) - unique identifier for the new unit; the contents are contingent on the type of unit being uploaded 
 unit_metadata (object) - (optional) extra metadata describing the unit; the contents will vary based on the importer handling the import 
 override_config (object) - (optional) importer configuration values that override the importer’s default configuration 

 A 202 OK or error will be returned as it is imported asynchronously. Importing will leave the upload file in place in case the user wants to import again into other repos using the upload interface. If not then auto cleanup will take care of the vestige upload_id.