Refactor #2950
closedImprove Custom Storage Backend for 3.0
100%
Description
The Django FileField[0] can be set using either a file path as a string or an instance of File[1].
When a string is passed in, the file is considered committed to disk. When saving an instance of a model that is using that FileField, the string that was used to set the file field is stored in the database with the rest of the model fields.
When an instance of File is used to set the field, the file is considered not committed. When saving a model instance with a field set that way, a storage backend is used to save the file to it's destination defined by 'upload_to' parameter of the FileField. After the file is put in place, it's final path is stored in the database along with other model field values.
The Artifact model uses a FileField to store the path to an Artifact stored in /var/lib/pulp/artifact. The 'upload_to' is a callable that returns a path in /var/lib/pulp/artifact based on the Artifact's sha256 field value. All this leaves the plugin writer with two options when creating an Artifact: move the file into /var/lib/pulp on her own or she can rely on the storage backend to move the file into the correct location. The latter provides a much simpler experience for plugin writers and a consistent behavior across plugins.
The FileSystemStorage[2] backend provided by Django is designed to work with File, TemporaryUploadedFile[3], and InMemoryUploadedFile[4] objects. The latter two inherit from the first. The type of File determines how the storage backend places it in it's final destination. The Artifact upload API always uses a TemporaryUploadedFile, which writes the file to a temporary location on disk during the upload. Below is an outline of how FileSystemStorage saves a file to its destination for the two File types we are interested in:
TemporaryUploadedFile
------------------------------
1) is name available?
2a) yes, os.rename()
3a) no exception, you are done
3b) exception, copy from source to destination using python and delete the original
2b) no, append random characters to to the name and go back to 1
File
-----
1) is name available?
2a) yes, copy from source to destination using python
2b) no, append random characters to to the name and go back to 1
The save behavior for TemporaryUploadedFile is the most efficient for Pulp's use case. However, the behavior from 2b can result in duplicate files being created, but not associated with any Artifact. There is a race condition if the same file is being uploaded in two parallel requests. One of the requests will create the Artifact in /var/lib/pulp/artifact and then save it to the database. The second request will also create a file in /var/lib/pulp/artifacts, but it will have a random name. The database insert will fail due to uniqueness constraint and the user will receive a 400 error. I don't think any cleanup happens when this occurs.
Due to how os.rename() works, there is also a race condition that would allow an existing file to be overwritten by a newer one. Since Pulp uses content addressable storage, this does not pose a problem.
The FileSystemStorage backend has been in Django's codebase for 9 years. Extending it to stop the file renaming behavior would allow us to take advantage of all the testing that has been performed on it's file moving and copying code. The Pulp plugin API should provide a TemporaryDownloadedFile that plugin writers can use to create Artifacts. The TemporaryDownloadedFile needs to have a temporary_file_path() method that tells FileSystemStorage to use the code path for TemporaryUploadedFile.
[0] https://github.com/django/django/blob/1.11/django/db/models/fields/files.py#L220
[1] https://github.com/django/django/blob/1.11/django/core/files/base.py#L14
[2] https://github.com/django/django/blob/1.11/django/core/files/storage.py#L249
[3] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L59
[4] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L84
The FileSystemStorage backend provided by Django has been extended to conform to Pulp needs. When saving a file, the storage backend does nothing if a file by that name already exists. All TemporaryDownloadedFile and TemporaryUploadedFile objects are moved using os.rename() when possible.
The new ArtifactFileField allows plugin writers to instantiate an Artifact with a string path to a downloaded file. This helps the plugin writer avoid having to open and close a downloaded file. The pre_save()[0] method instantiates a TemporaryDownloadedFile object and passes it to the storage backend to save into place. The save()[1] method on the Artifact then closes the file opened by pre_save().
The get_artifact_path() and published_metadata_path() static methods are now helper methods that can be used with any storage backend. In particular, I expect these to work with S3 storage backend without any modification.
The signal handler for pre_delete of the Artifact model has also been removed. This handler relied on the custom storage backend to remove empty directories left behind after an Artifact has been removed. This is not necesary because there can only be 255 directories inside /var/lib/pulp/artifact. It is highly unlikely that a directory will remain empty for long.
[0] https://docs.djangoproject.com/en/1.11/ref/models/fields/#django.db.models.Field.pre_save [1] https://docs.djangoproject.com/en/1.11/ref/models/instances/#django.db.models.Model.save
closes: #2950 https://pulp.plan.io/issues/2950