Issue #3915
closedThe ArtifactFileField is deleting source files.
Description
Problem¶
The ArtifactFileField uses TemporaryDownloadedFile is based on TemporaryUploadedFile which has the unexpected behavior of deleting the source file. When used with a file downloaded to the task temporary directory, the delete is undesirable because it's not safe to assume the plugin is finished with the file.
cat = Artifact(file='/var/cache/pulp/abcd', sha256=TIGER)
cat.save()
As a result, /var/lib/pulp/artifact/TI/GER is created and (tiger) Artifact.file=/var/lib/pulp/artifact/TI/GER is inserted in the DB but /var/cache/pulp/abcd got deleted and I'm not done with it yet.
Bigger Problem¶
This is especially bad in cases where Artifact.file is set to a path within the MEDIA_ROOT/artifact as is recommended to be done for bulk create.
Consider a simple mistake made by a plugin writer.
cat = Artifact(file='/var/lib/pulp/artifact/CA/T, sha256=CAT)
cat.save()
As a result, /var/lib/pulp/artifact/CA/T is created and (cat) Artifact.file=/var/lib/pulp/artifact/CA/T is inserted in the DB.
Then, oops, created the dog artifact with file=<same as cat>
dog = Artifact(file='/var/lib/pulp/artifact/CA/T, sha256=DOG)
dog.save() # This could be Artifact.objects.bulk_create([dog]) with same result.
The result is:
1. /var/lib/pulp/artifact/DO/G is created with content of: /var/lib/pulp/artifact/CA/T
2. /var/lib/pulp/artifact/CA/T is deleted and now the cat artifact is broken.
3. The (dog) artifact is inserted in the DB with file=/var/lib/pulp/artifact/DO/G
Recommended:¶
The ArtifactFileField should not be deleting files.
The Artifact(file=) should not permit file= be inside the MEDIA_ROOT.
Updated by bmbouter over 6 years ago
I agree it's possible for a plugin writer to do something unsafe if they are doing it wrong. That's a usability problem, but I'm not sure how we can improve it (see the last bit).
The main problem statement suggests that it's unexpected that the file will be moved in to place. I don't understand that problem statement. Django works that way, and Pulp documents it too here: https://github.com/pulp/pulp/blob/master/pulpcore/pulpcore/app/models/content.py#L15-L16
In practice how would this simple mistake happen? plugin writers don't type in these paths, they get them using Artifact.storage_path() and use the path unmodified.
I could see adding some validation as one way to improve this. The problem is though, that validation would be only be called when calling save() itself, that validation would never be called when plugin writers are moving files into place themsleves as with Artifact.bulk_create().
So even if we add validation plugin writers can still mess things up. There isn't any way we can stop them because they have access to the filesystem directly.
@jortel so with all ^ comments what do you think we should do?
Updated by jortel@redhat.com over 6 years ago
bmbouter wrote:
I agree it's possible for a plugin writer to do something unsafe if they are doing it wrong. That's a usability problem, but I'm not sure how we can improve it (see the last bit).
The main problem statement suggests that it's unexpected that the file will be moved in to place. I don't understand that problem statement. Django works that way, and Pulp documents it too here: https://github.com/pulp/pulp/blob/master/pulpcore/pulpcore/app/models/content.py#L15-L16
I don't agree that "Django works that way". The delete only happens because our ArtifactFileField is implemented using (indirectly) TemporaryUploadFile which deletes the source by default. The stock django FileField does not behave this way. Except for [1]. So, I think most django developers will find the delete surprising.
As for being documented. Agreed. I missed that. Although the term "move" literally means that the file will no longer be located at the source, It would be easy read that figuratively and the delete could be unexpected. That said, clearly saying that the source will be DELETED would be an improvement.
[1] The FileSystemStorage does move the file instead of copy when on the same filesystem which means that it isn't consistent behavior or consistent with the Storage API.
In practice how would this simple mistake happen? plugin writers don't type in these paths, they get them using Artifact.storage_path() and use the path unmodified.
Per this example in the description, the plugin writer could mean to be doing the right thing but accidentally do something else. Our artifact storage API should guard against foreseeable pitfalls much as possible/practical.
I could see adding some validation as one way to improve this. The problem is though, that validation would be only be called when calling save() itself, that validation would never be called when plugin writers are moving files into place themsleves as with Artifact.bulk_create().
So even if we add validation plugin writers can still mess things up. There isn't any way we can stop them because they have access to the filesystem directly.
True, but I think plugin code that touches the filesystem directly would be a clear indicator that writer did not have good intentions.
@jortel so with all ^ comments what do you think we should do?
Adding clearer, more explicit language in the Artifact docstrings that the source file will be deleted would help.
I'm mainly concerned about setting file= to something already in storage (within MEDIA_ROOT) when creating an Artifact given the (move) delete behavior. I don't think we need to permit file= being a path or anything in side the MEDIA_ROOT to support bulk create and we should prevent it.
This reproducer bulk creates an artifact using file=/tmp/jeff and the file is moved into storage.
def reproducer():
artifact = Artifact(
file='/tmp/jeff',
size=1,
md5='a',
sha1='a',
sha224='a',
sha256='apple',
sha384='a',
sha512='a')
Artifact.objects.bulk_create([artifact])
File /var/lib/pulp/artifact/ap/ple is created with jeff content.
[jortel@f27a artifact]$ tree .
.
├── 03
│ └── f03d345a6e562ad6618cfab11615dec474f1f593265fceff7694ce5f8024d8
├── a7
│ └── ad7899772cd1148c1b103cf661899890cd4e325582f947b05e1d36ecc94ae1
└── ap
└── ple <------- HERE
pulp=# select id,file from pulp_app_artifact;
id | file
----+------------------------------------------------------------------------------------------
19 | /var/lib/pulp/artifact/03/f03d345a6e562ad6618cfab11615dec474f1f593265fceff7694ce5f8024d8
22 | /var/lib/pulp/artifact/49/bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14
23 | /var/lib/pulp/artifact/ap/ple <----- HERE
(5 rows)
Updated by dkliban@redhat.com about 6 years ago
@jortel, I was able to run your reproducer with the bulk_save(). That means the pre_save handler is getting called. We could definitely add the check on the source location in there.
Updated by jortel@redhat.com about 6 years ago
dkliban@redhat.com wrote:
@jortel, I was able to run your reproducer with the bulk_save(). That means the pre_save handler is getting called. We could definitely add the check on the source location in there.
+1 for the a validation that prevents the source path from being inside the MEDIA_ROOT.
The reproducer also demonstrates that we don't need (or want) plugin writers (or any of the core stages) to be doing what is recommended here to support bulk save. Also, the proposed validation would not allow it.
Updated by dkliban@redhat.com about 6 years ago
@jortel, I agree that we should remove the documentation around this problem. We should also remove the code that does the manual moving of files in the ArtifactSaver stage.
Updated by bmbouter about 6 years ago
@jortel thank you for re-raising this. I was baffled to see that our pre_save handler was being called and working correctly since the Django bulk_create() docs told me it would be. After looking around in the code I wanted to leave a bit of info on what I found.
Our pre_save is actually a "field" level handler which is used both by normal save() and bulk_create(). The save() handlers and signals pre_save, post_save in the docs are not the same thing and indeed are not called by bulk_create(). These all being named the same thing is confusing.
What's great is that this change is definitely an improvement! Thank you.
Updated by CodeHeeler about 6 years ago
- Priority changed from High to Normal
- Triaged changed from No to Yes
- Sprint set to Sprint 43
- Tags Pulp 3 added
Updated by dkliban@redhat.com about 6 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
Updated by dkliban@redhat.com about 6 years ago
- Status changed from ASSIGNED to POST
Added by dkliban@redhat.com about 6 years ago
Added by dkliban@redhat.com about 6 years ago
Revision c41450e8 | View on GitHub
Problem: ArtifactFileField can delete an existing artifact's file
Solution: Add a check to see if the source file is already in artifact storage
Prior to this patch it was possible to break an existing artifact by passing a path to it's file as the file for a new Artifact.
This patch also removes extraneous code in the ArtifactSaver stage. The code was present because an assumption was made that bulk_save() on Artifacts would not call the presave handler on the ArtifactFileField. However, this is not the case and the pre_save handler is getting called.
Updated by dkliban@redhat.com about 6 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulp|c41450e89d8181fc905644239aa33b73b69ca3fc.
Updated by bmbouter almost 5 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Problem: ArtifactFileField can delete an existing artifact's file
Solution: Add a check to see if the source file is already in artifact storage
Prior to this patch it was possible to break an existing artifact by passing a path to it's file as the file for a new Artifact.
This patch also removes extraneous code in the ArtifactSaver stage. The code was present because an assumption was made that bulk_save() on Artifacts would not call the presave handler on the ArtifactFileField. However, this is not the case and the pre_save handler is getting called.
closes: #3915 https://pulp.plan.io/issues/3915