RPMs partially downloaded
There have been multiple reports of RPMs ending up in /var/lib/pulp/content/ with either 0 bytes, or partially downloaded. Looking at the 6.1 code, it is difficult to identify how that is possible.
You can see here that in pulp 2.6, an rpm does not get saved into the DB until after validation has happened, and the file has been moved into place without errors. Katello has assured us that they supply "validate: true" with each sync request, so validation should be happening.
And yet, users are seeing this happen, so we need to investigate further.
#8 Updated by ttereshc over 5 years ago
An attempt to do an audit of the code path for downloading RPMs (pulp 2.6-release, pulp_rpm 2.6-release, nectar 1.3.3-1) results in the following.
Principal suspicions so far:
- non-atomic `shutil.move` could silently collide with each other
- pre-move content validation may not notice corrupted content
- possible race condition on concurrent downloads
`shutil.move` may make the process of putting temp file into its final location non-atomic (between different FS) and thus unsafe unless proper locking is used.
Copy to the final location + concurrent downloads of the same RPM could be an issue.
Assuming that `shutil.move` can't silently fail there are few potential scenarios in which RPM file could be corrupted:
- Upstream repo A and repo B which both contain the same RPM.
- RPM from both repos is synced nearly at the same time.
- Download of RPM from repo B started before download from repo A was finished (second download started because there is no reference for RPM in DB yet).
- Download from repo A succeeded, validation succeeded, file was moved to the final location and is not corrupted anyhow, reference to this RPM is saved into the DB.
- Download from repo B succeeded, validation succeeded, there was some problem during copy to the final location.
- Because the final location will be the same in both cases, the file could be corrupted at the end.
- One repo, two nearly concurrent downloads of the same RPM from the same repo
- First download succeeded, validation succeeded but before it was copied to the final location second download started to write into the same file in the working directory.
- Possibly corrupted file is moved to the final location and saved into the DB.
- Both scenarios seem to me to be very unlikely but in most cases RPM were partially downloaded in case of some network problems, so the probability of the race conditions is getting higher.
- How to end up with concurrent downloads of the same RPM from the same repo?
-- I do not think it is possible if there is only one feed specified for the repo. I followed the code path and all the locks look good to me.
-- So alternate content sources are under more suspicion, several queues, one for each source, are used in this case. More investigation needed.
Suggestion/workaround for both scenarios so far:
Validate file at its final location (copy file from working directory into the temp name into the final directory, then validate and atomically rename it to the final name)
Any thoughts? or issues in my logic?
#10 Updated by mhrivnak over 5 years ago
That all seems reasonable. I like the idea of validating it in the final location but with a different filename, and doing the atomic rename.
I added an association to issue #2142, which may also reveal a possible cause. That one seems to be quite reproducible with docker content. I could not reproduce that behavior with an rpm sync, but it's possible that something has changed in the yum importer since 2.6 to explain it. Maybe whatever problem is present in the docker importer used to also affect the yum importer.
#15 Updated by ttereshc about 5 years ago
Now we are doing an additional validation during copy to the final location of the file, so there will be errors during sync when files are corrupted. However, since root cause of the corrupted files is still unclear, this approach does not fix the root issue.