Project

Profile

Help

Refactor #2950

closed

Improve Custom Storage Backend for 3.0

Added by dkliban@redhat.com over 6 years ago. Updated over 4 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Sprint:
Sprint 27
Quarter:

Description

The Django FileField[0] can be set using either a file path as a string or an instance of File[1].

When a string is passed in, the file is considered committed to disk. When saving an instance of a model that is using that FileField, the string that was used to set the file field is stored in the database with the rest of the model fields.

When an instance of File is used to set the field, the file is considered not committed. When saving a model instance with a field set that way, a storage backend is used to save the file to it's destination defined by 'upload_to' parameter of the FileField. After the file is put in place, it's final path is stored in the database along with other model field values.

The Artifact model uses a FileField to store the path to an Artifact stored in /var/lib/pulp/artifact. The 'upload_to' is a callable that returns a path in /var/lib/pulp/artifact based on the Artifact's sha256 field value. All this leaves the plugin writer with two options when creating an Artifact: move the file into /var/lib/pulp on her own or she can rely on the storage backend to move the file into the correct location. The latter provides a much simpler experience for plugin writers and a consistent behavior across plugins.

The FileSystemStorage[2] backend provided by Django is designed to work with File, TemporaryUploadedFile[3], and InMemoryUploadedFile[4] objects. The latter two inherit from the first. The type of File determines how the storage backend places it in it's final destination. The Artifact upload API always uses a TemporaryUploadedFile, which writes the file to a temporary location on disk during the upload. Below is an outline of how FileSystemStorage saves a file to its destination for the two File types we are interested in:

TemporaryUploadedFile
------------------------------
    1) is name available?
         2a) yes, os.rename()
                 3a) no exception, you are done
                 3b) exception, copy from source to destination using python and delete the original
         2b) no, append random characters to to the name and go back to 1

File
-----
    1) is name available?
         2a) yes, copy from source to destination using python
         2b) no, append random characters to to the name and go back to 1

The save behavior for TemporaryUploadedFile is the most efficient for Pulp's use case. However, the behavior from 2b can result in duplicate files being created, but not associated with any Artifact. There is a race condition if the same file is being uploaded in two parallel requests. One of the requests will create the Artifact in /var/lib/pulp/artifact and then save it to the database. The second request will also create a file in /var/lib/pulp/artifacts, but it will have a random name. The database insert will fail due to uniqueness constraint and the user will receive a 400 error. I don't think any cleanup happens when this occurs.

Due to how os.rename() works, there is also a race condition that would allow an existing file to be overwritten by a newer one. Since Pulp uses content addressable storage, this does not pose a problem.

The FileSystemStorage backend has been in Django's codebase for 9 years. Extending it to stop the file renaming behavior would allow us to take advantage of all the testing that has been performed on it's file moving and copying code. The Pulp plugin API should provide a TemporaryDownloadedFile that plugin writers can use to create Artifacts. The TemporaryDownloadedFile needs to have a temporary_file_path() method that tells FileSystemStorage to use the code path for TemporaryUploadedFile.

[0] https://github.com/django/django/blob/1.11/django/db/models/fields/files.py#L220
[1] https://github.com/django/django/blob/1.11/django/core/files/base.py#L14
[2] https://github.com/django/django/blob/1.11/django/core/files/storage.py#L249
[3] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L59
[4] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L84

Actions #1

Updated by bmbouter over 6 years ago

Can links be posted to the code to be removed?

Actions #2

Updated by ttereshc over 6 years ago

  • Tracker changed from Issue to Refactor
  • % Done set to 0

Consider adding to the current sprint after grooming

Actions #3

Updated by jortel@redhat.com over 6 years ago

I ended up writing the FileStorage because getting the django builtin file storage engine to work was just too hard. It has behaviors that did not fit well with how we store artifacts. The file stuff in django is poorly documented (and in many cases poorly written) and I burned a lot of time digging though the code. That said, perhaps I missed something easy we can do to use the builtin storage. I am very much in favor of doing this task but I do not think it should be done before the alpha.

Actions #4

Updated by bmbouter over 6 years ago

  • Subject changed from remove custom storage backend from 3.0 to Remove Custom Storage Backend from 3.0
  • Description updated (diff)
  • Priority changed from Normal to High
  • Tags deleted (Pulp 3 Plugin Writer Alpha)

I added a link to the description. I agree not doing this for the alpha is OK. I've removed the alpha tag. I changed it to high priority because I think we need to swap this out soon.

Actions #5

Updated by bmbouter over 6 years ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

I think this is ready to be groomed. If others don't want it groomed or think it needs more discussion please ungroom.

Actions #6

Updated by mhrivnak over 6 years ago

Some users will want to look at the file on disk for various reasons, particularly if they suspect something has gone wrong. We see that a lot with Pulp 2. So I think it's valuable to return the path on disk through the API. If we don't, those users will go digging through the database to find it, and we'd rather not give them reason to go in there.

Actions #7

Updated by mhrivnak over 6 years ago

  • Sprint/Milestone set to 45
Actions #8

Updated by jortel@redhat.com over 6 years ago

The FileStorage class conforms to the django Storage interface, is tested and works. I'm in favor of revisiting this at some point but given the number of features that still need to be implemented and bugs that need to get fixed in pulp3, we should not be spending time on this now. I should have objected during sprint 26 planning.

Actions #9

Updated by dkliban@redhat.com over 6 years ago

The only part of the custom storage backend that should remain is the implementation for the _save() method. This is the functionality that will allow us to put bits on disk once and then use the move operation to only logically move the file in it's final destination. The rest of the implementation should use the default behavior from Django.

Actions #10

Updated by dkliban@redhat.com over 6 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
Actions #11

Updated by dkliban@redhat.com over 6 years ago

  • Status changed from ASSIGNED to POST
Actions #12

Updated by dkliban@redhat.com over 6 years ago

  • Subject changed from Remove Custom Storage Backend from 3.0 to Improve Custom Storage Backend for 3.0
  • Description updated (diff)

I updated the issue description with everything I learned while working on this ticket.

Actions #13

Updated by dkliban@redhat.com over 6 years ago

  • Description updated (diff)
Actions #14

Updated by jortel@redhat.com over 6 years ago

  • Sprint/Milestone changed from 45 to 46

Added by dkliban@redhat.com over 6 years ago

Revision 15857fb0 | View on GitHub

The FileSystemStorage backend provided by Django has been extended to conform to Pulp needs. When saving a file, the storage backend does nothing if a file by that name already exists. All TemporaryDownloadedFile and TemporaryUploadedFile objects are moved using os.rename() when possible.

The new ArtifactFileField allows plugin writers to instantiate an Artifact with a string path to a downloaded file. This helps the plugin writer avoid having to open and close a downloaded file. The pre_save()[0] method instantiates a TemporaryDownloadedFile object and passes it to the storage backend to save into place. The save()[1] method on the Artifact then closes the file opened by pre_save().

The get_artifact_path() and published_metadata_path() static methods are now helper methods that can be used with any storage backend. In particular, I expect these to work with S3 storage backend without any modification.

The signal handler for pre_delete of the Artifact model has also been removed. This handler relied on the custom storage backend to remove empty directories left behind after an Artifact has been removed. This is not necesary because there can only be 255 directories inside /var/lib/pulp/artifact. It is highly unlikely that a directory will remain empty for long.

[0] https://docs.djangoproject.com/en/1.11/ref/models/fields/#django.db.models.Field.pre_save [1] https://docs.djangoproject.com/en/1.11/ref/models/instances/#django.db.models.Model.save

closes: #2950 https://pulp.plan.io/issues/2950

Added by dkliban@redhat.com over 6 years ago

Revision 15857fb0 | View on GitHub

The FileSystemStorage backend provided by Django has been extended to conform to Pulp needs. When saving a file, the storage backend does nothing if a file by that name already exists. All TemporaryDownloadedFile and TemporaryUploadedFile objects are moved using os.rename() when possible.

The new ArtifactFileField allows plugin writers to instantiate an Artifact with a string path to a downloaded file. This helps the plugin writer avoid having to open and close a downloaded file. The pre_save()[0] method instantiates a TemporaryDownloadedFile object and passes it to the storage backend to save into place. The save()[1] method on the Artifact then closes the file opened by pre_save().

The get_artifact_path() and published_metadata_path() static methods are now helper methods that can be used with any storage backend. In particular, I expect these to work with S3 storage backend without any modification.

The signal handler for pre_delete of the Artifact model has also been removed. This handler relied on the custom storage backend to remove empty directories left behind after an Artifact has been removed. This is not necesary because there can only be 255 directories inside /var/lib/pulp/artifact. It is highly unlikely that a directory will remain empty for long.

[0] https://docs.djangoproject.com/en/1.11/ref/models/fields/#django.db.models.Field.pre_save [1] https://docs.djangoproject.com/en/1.11/ref/models/instances/#django.db.models.Model.save

closes: #2950 https://pulp.plan.io/issues/2950

Actions #15

Updated by dkliban@redhat.com over 6 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #16

Updated by bmbouter about 6 years ago

  • Sprint set to Sprint 27
Actions #17

Updated by bmbouter about 6 years ago

  • Sprint/Milestone deleted (46)
Actions #18

Updated by daviddavis almost 5 years ago

  • Sprint/Milestone set to 3.0.0
Actions #19

Updated by bmbouter almost 5 years ago

  • Tags deleted (Pulp 3)
Actions #20

Updated by bmbouter over 4 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF