Issue #1823
closedRPMs partially downloaded
Description
There have been multiple reports of RPMs ending up in /var/lib/pulp/content/ with either 0 bytes, or partially downloaded. Looking at the 6.1 code, it is difficult to identify how that is possible.
You can see here that in pulp 2.6, an rpm does not get saved into the DB until after validation has happened, and the file has been moved into place without errors. Katello has assured us that they supply "validate: true" with each sync request, so validation should be happening.
And yet, users are seeing this happen, so we need to investigate further.
Related issues
Updated by Anonymous almost 7 years ago
- Groomed changed from No to Yes
- Sprint Candidate changed from No to Yes
Updated by Anonymous almost 7 years ago
- Sprint/Milestone set to 22
- Groomed changed from Yes to No
- Sprint Candidate changed from Yes to No
Updated by ttereshc almost 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to ttereshc
Updated by ttereshc almost 7 years ago
An attempt to do an audit of the code path for downloading RPMs (pulp 2.6-release, pulp_rpm 2.6-release, nectar 1.3.3-1) results in the following.
Principal suspicions so far:
- non-atomic `shutil.move` could silently collide with each other
- pre-move content validation may not notice corrupted content
- possible race condition on concurrent downloads
`shutil.move` may make the process of putting temp file into its final location non-atomic (between different FS) and thus unsafe unless proper locking is used.
Copy to the final location + concurrent downloads of the same RPM could be an issue.
Assuming that `shutil.move` can't silently fail there are few potential scenarios in which RPM file could be corrupted:
Scenario 1
- Upstream repo A and repo B which both contain the same RPM.
- RPM from both repos is synced nearly at the same time.
- Download of RPM from repo B started before download from repo A was finished (second download started because there is no reference for RPM in DB yet).
- Download from repo A succeeded, validation succeeded, file was moved to the final location and is not corrupted anyhow, reference to this RPM is saved into the DB.
- Download from repo B succeeded, validation succeeded, there was some problem during copy to the final location.
- Because the final location will be the same in both cases, the file could be corrupted at the end.
Scenario 2
- One repo, two nearly concurrent downloads of the same RPM from the same repo
- First download succeeded, validation succeeded but before it was copied to the final location second download started to write into the same file in the working directory.
- Possibly corrupted file is moved to the final location and saved into the DB.
Uncertainties:
- Both scenarios seem to me to be very unlikely but in most cases RPM were partially downloaded in case of some network problems, so the probability of the race conditions is getting higher.
- How to end up with concurrent downloads of the same RPM from the same repo?
-- I do not think it is possible if there is only one feed specified for the repo. I followed the code path and all the locks look good to me.
-- So alternate content sources are under more suspicion, several queues, one for each source, are used in this case. More investigation needed.
Suggestion/workaround for both scenarios so far:
Validate file at its final location (copy file from working directory into the temp name into the final directory, then validate and atomically rename it to the final name)
Any thoughts? or issues in my logic?
Updated by mhrivnak almost 7 years ago
- Related to Issue #2142: Units created with 0-byte files when sync runs out of disk space added
Updated by mhrivnak almost 7 years ago
That all seems reasonable. I like the idea of validating it in the final location but with a different filename, and doing the atomic rename.
I added an association to issue #2142, which may also reveal a possible cause. That one seems to be quite reproducible with docker content. I could not reproduce that behavior with an rpm sync, but it's possible that something has changed in the yum importer since 2.6 to explain it. Maybe whatever problem is present in the docker importer used to also affect the yum importer.
Updated by ttereshc almost 7 years ago
- Blocked by Issue #2190: Unit is associated with the repo before it is copied to the final location added
Updated by ttereshc over 6 years ago
- Status changed from ASSIGNED to POST
Added by ttereshc over 6 years ago
Added by ttereshc over 6 years ago
Verify size of a file at its final location
Added by ttereshc over 6 years ago
Verify size of a file at its final location
Updated by ttereshc over 6 years ago
- Status changed from POST to MODIFIED
Applied in changeset 96fd275c2b8b3e6f1ce866a21af2ce47736c8f39.
Updated by ttereshc over 6 years ago
Now we are doing an additional validation during copy to the final location of the file, so there will be errors during sync when files are corrupted. However, since root cause of the corrupted files is still unclear, this approach does not fix the root issue.
Updated by semyers over 6 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Verify RPM/SRPM/DRPM unit at its final location
closes #1823 https://pulp.plan.io/issues/1823