Refactor #2950
Updated by dkliban@redhat.com about 7 years ago
The Django FileField[0] can be set using either a file path as a string or an instance of File[1]. When a string is passed in, the file is considered committed to disk. When saving an instance of a model that is using that FileField, the string that was used to set the file field is stored in the database with the rest of the model fields. When an instance of File is used to set the field, the file is considered not committed. When saving a model instance with a field set that way, a storage backend is used to save the file to it's destination defined by 'upload_to' parameter of the FileField. After the file is put in place, it's final path is stored in the database along with other model field values. The Artifact model uses a FileField to store the path to an Artifact stored in /var/lib/pulp/artifact. The 'upload_to' is a callable that returns a path in /var/lib/pulp/artifact based on the Artifact's sha256 field value. All this leaves the plugin writer with two options when creating an Artifact: move the file into /var/lib/pulp on her own or she can rely on the storage backend to move the file into the correct location. The latter provides a much simpler experience for plugin writers and a consistent behavior across plugins. The FileSystemStorage[2] backend provided by Django is designed to work with File, TemporaryUploadedFile[3], and InMemoryUploadedFile[4] objects. The latter two inherit from the first. The type of File determines how the storage backend places it in it's final destination. The Artifact upload API always uses a TemporaryUploadedFile, which writes the file to a temporary location on disk during the upload. Below is an outline of how FileSystemStorage saves a file to its destination for the two File types we are interested in: <pre> TemporaryUploadedFile ------------------------------ 1) is name available? 2a) yes, os.rename() 3a) no exception, you are done 3b) exception, copy from source to destination using python and delete the original 2b) no, append random characters to to the name and go back to 1 File ----- 1) is name available? 2a) yes, copy from source to destination using python 2b) no, append random characters to to the name and go back to 1 </pre> The save behavior for TemporaryUploadedFile is the most efficient for Pulp's use case. However, the behavior from 2b can result in duplicate files being created, but not associated with any Artifact. There is a race condition if the same file is being uploaded in two parallel requests. One of the requests will create the Artifact in /var/lib/pulp/artifact and then save it to the database. The second request will also create a file in /var/lib/pulp/artifacts, but it will have a random name. The database insert will fail due to uniqueness constraint and the user will receive a 400 error. I don't think any cleanup happens when this occurs. Due to how os.rename() works, there is also a race condition that would allow an existing file to be overwritten by a newer one. Since Pulp uses content addressable storage, this does not pose a problem. The FileSystemStorage backend has been in Django's codebase for 9 years. Extending it to stop the file renaming behavior would allow us to take advantage of all the testing that has been performed on it's file moving and copying code. The Pulp plugin API should provide a TemporaryDownloadedFile that plugin writers can use to create Artifacts. The TemporaryDownloadedFile needs to have a temporary_file_path() method that tells FileSystemStorage to use the code path for TemporaryUploadedFile. [0] https://github.com/django/django/blob/1.11/django/db/models/fields/files.py#L220 [1] https://github.com/django/django/blob/1.11/django/core/files/base.py#L14 [2] https://github.com/django/django/blob/1.11/django/core/files/storage.py#L249 [3] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L59 [4] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L84