Project

Profile

Help

Story #5613

Provide ability to report and redownload (if possible) corrupted content on the file system

Added by jsherril@redhat.com 4 months ago. Updated about 1 month ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
No
Sprint Candidate:
No
Tags:
Katello-P2
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

Users can experience corrupted files either due to bit rot on mechanical hard drives or mistakenly run commands. We should make it as easy as possible to correct this situation when we can. I imagine some task that for a given repo:

1. examines all downloaded content in the latest version and checks to see if each unit matches its expected checksum
2. re-downloads any corrupted files if possible
3. reports any units that could not be re-downloaded as counts

Note: this is only going to fix bitrot/corrupted files. It only inspects the content contained in the latest repository version, and does not consider if that repository version is an accurate or complete representation of the remote filesystem.

History

#1 Updated by bmbouter 3 months ago

  • Sprint/Milestone set to 3.1.0

#2 Updated by daviddavis about 2 months ago

  • Sprint/Milestone deleted (3.1.0)

#3 Updated by bmbouter about 1 month ago

Question 1: which endpoint?

I was wondering if this should be an action endpoint or not underneath sync, e.g. /fix/ or /repair/. I decided it should not be because:

a) we may want to offer this functionality system-wide later
b) plugin writers would have to enable that endpoint. We could enable it for all plugins without their involvement, but that weakens the practice that repo detail endpoints are plugin owned.

So if we did it system wide would it be?

/pulp/api/v3/repair/
/pulp/api/v3/filesystem-repair/
/pulp/api/v3/repo-repair/
/pulp/api/v3/fix/
/pulp/api/v3/filesystem-fix/
/pulp/api/v3/repo-fix/

Question 2: What Pulp locks need to guard this task?

I think we need to lock on the repository so that other repository operations won't make new repo versions underneath the content while it's being examined. System-wide checking (not proposed in this ticket) would be a whole different story.

Question 3: To confirm, this only fixes "downloaded content" right?

So if we implement a "continue even if 404s occur" and say two RPMs are missing from the latest repo version. This utility won't also fix that. If it's expected to also fix that then we need to integrate this deeply with sync also somehow. What can we do that is the best regarding this case?

#4 Updated by jsherril@redhat.com about 1 month ago

Couple of comments:

Question 2: This makes sense to me

Question 3:

"So if we implement a "continue even if 404s occur" and say two RPMs are missing from the latest repo version. This utility won't also fix that. If it's expected to also fix that then we need to integrate this deeply with sync also somehow. What can we do that is the best regarding this case?"

I think this repair task not fixing is fine, since it was never downloaded in the first place. But what would fix that? A further sync? (assuming the underlying upstream repository is fixed too).

#5 Updated by bmbouter about 1 month ago

wrote:

I think this repair task not fixing is fine, since it was never downloaded in the first place. But what would fix that? A further sync? (assuming the underlying upstream repository is fixed too).

Yes a further sync would resolve this if the upstream repository was also fixed. Pulp is going to recognize it does not have the content unit during that sync.

#6 Updated by ttereshc about 1 month ago

Question 3:

"So if we implement a "continue even if 404s occur" and say two RPMs are missing from the latest repo version. This utility won't also fix that. If it's expected to also fix that then we need to integrate this deeply with sync also somehow. What can we do that is the best regarding this case?"

I think this repair task not fixing is fine, since it was never downloaded in the first place. But what would fix that? A further sync? (assuming the underlying upstream repository is fixed too).

For the upcoming sync to fix it, a sync should be always operational, with no condition to skip it.
How confident are we that every sync being operational is good enough performance-wise? If nothing changed upstream, we'll still analyse all the repodata, we'll check if every content exists in pulp and then we may perform all the checks to ensure that repo is valid (I'm not sure if we are performing this check if nothing changed in a repo). Maybe it's fast and no need to be worried.

If we want any optimization, we'll either need to introduce force_full to pulp3, or we may need to do full repair here and not only downloaded content.
What do you think?

Also we potentially can have different options for the suggested endpoints:
/pulp/api/v3/repair/ - full repair = corrupted files + download missing ones
/pulp/api/v3/filesystem-repair/ - only corrupted files
/pulp/api/v3/repo-repair/ - corrupted and downloaded for a specific repo

I'm worried that having all those options adds complexity but at the same time it gives flexibility.

#7 Updated by bmbouter about 1 month ago

ttereshc wrote:

For the upcoming sync to fix it, a sync should be always operational, with no condition to skip it.

Agreed

How confident are we that every sync being operational is good enough performance-wise? If nothing changed upstream, we'll still analyse all the repodata, we'll check if every content exists in pulp and then we may perform all the checks to ensure that repo is valid (I'm not sure if we are performing this check if nothing changed in a repo). Maybe it's fast and no need to be worried.

I suspect no matter how fast it is, users will want it to be faster in cases when it can be. I'm interested in us measuring the resync time in the case that nothing changed for RPM first so we can understand how much opportunity for speedup there is.

If we want any optimization, we'll either need to introduce force_full to pulp3, or we may need to do full repair here and not only downloaded content.
What do you think?

What I learned from pulp2 is that anytime we introduce an optomization we need a way to turn it (or all of them) off.

Also we potentially can have different options for the suggested endpoints:
/pulp/api/v3/repair/ - full repair = corrupted files + download missing ones
/pulp/api/v3/filesystem-repair/ - only corrupted files
/pulp/api/v3/repo-repair/ - corrupted and downloaded for a specific repo

I'm worried that having all those options adds complexity but at the same time it gives flexibility.

This is my primary concern also.

#8 Updated by bmbouter about 1 month ago

  • Description updated (diff)

I added clarification to the body that it only inspects the filesystem and won't check the remote metadata to consider if the repository version is "complete".

Question: how will the "unable to be fixed" units be reported?

In terms of current task reporting capabilities, progress reports are only prepared to report counts. Is that acceptable? It could look like:

{
    "progress_reports": [{
            "code": "fixed_count",
            "done": 5,
            "message": "The count of fixed items that were fixed.",
            "state": "completed",
            "suffix": null,
            "total": 5
        },
        {
            "code": "pre_fix_corrupted_count",
            "done": 10,
            "message": "The count of content items that were corrupted prior to fixing",
            "state": "completed",
            "suffix": null,
            "total": 10
        }
    ]
}

#9 Updated by jsherril@redhat.com about 1 month ago

That looks good to me!

Please register to edit this issue

Also available in: Atom PDF