Project

Profile

Help

Story #4488

As a user, I can upload chunks in parallel

Added by daviddavis 10 months ago. Updated 6 months ago.

Status:
MODIFIED
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Katello-P1
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:
Sprint 55

Description

We're currently using drf-chunked-uploads0 but it seems like the library has become unmaintained1 since we adopted. It has some other quirks and missing features too. So I think we should move off of it and roll our code as part of this story.

Solution

Add a design which supports sha256 and parallel uploads of chunks.

Models

Upload

id = UUID
file = File
size = BigIntegerField
user = FK
created_at = DateTimeField
completed_at = DateTimeField

UploadChunk

id = UUID
upload = FK
offset = BigIntegerField
size = BigIntegerField

Workflow

# create the upload session
http POST :24817/pulp/api/v3/uploads/ size=10485759 # returns a UUID (e.g. 345b7d58-f1f8-45d9-d354-82a31eb879bf)
export UPLOAD='/pulp/api/v3/uploads345b7d58-f1f8-45d9-d354-82a31eb879bf/'

# note the order doesn't matter here
http --form PUT :24817$UPLOAD file@./chunkab 'Content-Range:bytes 6291456-10485759/32095676'
http --form PUT :24817$UPLOAD file@./chunkaa 'Content-Range:bytes 0-6291455/32095676'

# view the upload and its chunks
http :24817${UPLOAD}

# complete the upload
http PUT :24817${UPLOAD}commit sha256=037a47d93670e64f2b1038e6f90e4cfd

# create the artifact from the upload
http POST :24817/pulp/api/v3/artifacts/ upload=$UPLOAD

Additional references

https://github.com/douglasmiranda/django-fine-uploader
https://medium.com/box-developer-blog/introducing-the-chunked-upload-api-f82c820ccfcb

[0] https://github.com/jkeifer/drf-chunked-upload
[1] https://github.com/jkeifer/drf-chunked-upload/pull/8


Related issues

Related to Pulp - Story #4196: As a user, I can upload files in chunks. MODIFIED Actions
Related to Pulp - Test #5263: Test - As a user, I can upload chunks in parallel NEW Actions
Blocks Pulp - Story #4988: As a user, I can remove uploads MODIFIED Actions

Associated revisions

Revision 24b50710 View on GitHub
Added by daviddavis 6 months ago

Add support for parallel chunks and sha256

Also removed drf-chunked-upload.

fixes #4488,#4486

History

#1 Updated by daviddavis 10 months ago

  • Related to Story #4196: As a user, I can upload files in chunks. added

#2 Updated by bmbouter 8 months ago

  • Tags deleted (Pulp 3)

#3 Updated by daviddavis 6 months ago

  • Description updated (diff)

#4 Updated by bmbouter 6 months ago

I really like this API. It's legit. I had a few questions I wanted to ask.

What if we didn't have the 'create the upload session' at all? Couldn't the client generate a uuid and start using it?

How do chunks that were never part of an artifact removed?

Should we send a digest value for each chunk? If you have a large file, e.g. many gigs, one incorrect chunk would cause you to upload everything again.

#5 Updated by daviddavis 6 months ago

What if we didn't have the 'create the upload session' at all? Couldn't the client generate a uuid and start using it?

I see a number of downsides to doing this. First, it's less RESTful. Second, we need to have the total file size before the upload to create the initial file. So we'd have to either pass in the TOTAL file size with the first request (may be hard with parallel uploads) or with every request (kind of awkward).

How do chunks that were never part of an artifact removed?

I am not totally sure what you're asking but if it's how to remove incomplete uploads, in drf-chunked-uploads they support this (see https://github.com/jkeifer/drf-chunked-upload#settings) but we have yet to leverage this feature. This problem exists currently though and is not needed for this story.

Should we send a digest value for each chunk? If you have a large file, e.g. many gigs, one incorrect chunk would cause you to upload everything again.

We could definitely add this but I think that's outside the scope of this story. Maybe file another story?

#6 Updated by dkliban@redhat.com 6 months ago

The user should have to start a session so Pulp can have an opportunity to allocate space for the entire upload. Each uploaded chunk can then be written to it's specific place in the file created session creation. This avoids having to write out the whole file when the upload is complete.

Accepting checksums with each uploaded chunk would be helpful.

#7 Updated by bmbouter 6 months ago

We can keep the session creation, it does allow you to make the large file and write into it. If you want to not have chunk checksums initially that is ok too. Also we can worry about the cleanup of uploads that never became Artifacts later too.

#9 Updated by daviddavis 6 months ago

  • Blocks Story #4988: As a user, I can remove uploads added

#10 Updated by ttereshc 6 months ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

#11 Updated by daviddavis 6 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis
  • Sprint set to Sprint 54
  • Tags Katello-P1 added

The changes to the API are blocking Katello who is trying to integrate chunked uploads. Setting P1 tag and adding to sprint.

#12 Updated by daviddavis 6 months ago

  • Description updated (diff)

#13 Updated by daviddavis 6 months ago

  • Description updated (diff)

#14 Updated by ttereshc 6 months ago

  • Sprint changed from Sprint 54 to Sprint 55

#15 Updated by daviddavis 6 months ago

  • Status changed from ASSIGNED to POST

#16 Updated by daviddavis 6 months ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#17 Updated by kersom 4 months ago

  • Related to Test #5263: Test - As a user, I can upload chunks in parallel added

Please register to edit this issue

Also available in: Atom PDF