Story #2619

Need a file-system integrity report for /var/lib/pulp

Added by about 1 year ago. Updated 3 days ago.

% Done:


Platform Release:
Blocks Release:
Backwards Incompatible:
Sprint Candidate:
QA Contact:
Smash Test:
Verification Required:


In certain cases due to historical issues with Satellite or on-going problems during content manipulation, there are situations where various inconsistencies can exist with the files contained within /var/lib/pulp

We have customers who are going to start deploying the new 'repair' facilities in this feature we are adding:

  • [RFE] Allow Pulp to verify/repair corrupted packages in a repository

with the addition of the repair side of this feature we need a way to identify the following conditions:

  • Missing RPMs from /var/lib/pulp/content
  • Corrupt/NOT OK md5sums on any unit in /var/lib/pulp/content
  • invalid repositories contained within /var/lib/pulp/published where the yum metadata points at sylinks that are missing
  • missing or broken symlinks for published repositories for Content Views


Source: /var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-Library-rhel7-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_RPMs_x86_64_7Server/1472243944.1

Target: /var/lib/pulp/published/yum/https/repos/Default_Organization/Library/rhel7/content/dist/rhel/server/7/7Server/x86_64/os

May add more criteria to check but in order to restore confidence in the integrity of /var/lib/pulp, we need to be able to report on the state of this sub-directory.

Runtimes to generate this report are expected to be very long but this should not be a blocker for the implementation


#1 Updated by 5 months ago

I imagine this would be a stand alone tool (script) that runs on each satellite/capsule and writes a report. The tool should display progress when possible and write a file containing the report.


$ tool -h

  -s  validate stored content file exists and match size/checksum when known.
  -b  validate symlinks (find broken)
  -m  validate that published metadata references valid symlinks
  -a  validate all.

  -p  restrict publishing validation to a specific directory.  default: /var/lib/pulp/published
  -o  path to generated report.

Some of the validation will require the tool to have content type specific knowledge so we need to determine which content types need to be supported. RPM has been specifically requested so let's start with that. The tool needs to be designed to support adding validation for additional content types as requested.

The tool should grab information about stale publishes and possibly qualify broken symlinks to reduce/eliminate false positives. Publishing is stale when the repository is sync'd after the last publish.

The report should have a heading for each test followed by summary and list of errors.

#2 Updated by 5 months ago

Requirement: if we are going to be using a CLI tool, we will need to machine parse the output of the report in other tools so it would be required to have an option to output in JSON or CSV in the least. JSON preferable.

#3 Updated by bmbouter 4 months ago

I think we should determine the expected data output and get the user to ack that it meets their needs before starting on the work. +1 to using json. Here is a half-proposal and a question:

"schema": 1,
"corrupted_content": ["/path/to/corrupted/file1", "/path/to/corrupted/file1"],
"broken_symlinks": ["/path/to/broken/symlink1", "/path/to/broken/symlink2"]

The schema defines the schema version of this.
The corrupted_content is a list of paths to corrupted content
The broken_symlinks is a list of broken symlinks

I'm not sure what the output format should be for the "published metadata does not reference valid symlinks" part of the report. Can someone write an example of what a failure like this would be. Even a written example (not a json example would be good).

#4 Updated by 2 months ago

+1 to the above report.

Perhaps for the metadata corruption we could just list the path to each set of metadata that could be considered 'invalid' where it references things that don't exist or are corrupt.

"invalid_metadata": ["/path/to/repodata/dir1", "/path/to/repodata/dir2"]

#5 Updated by bmbouter 2 months ago

Thanks for the info @mmccune. For the inavlid_metadata is that feature intended to inspect the published metadata and the files in the published repo and make sure all metadata entries refer to files that are present? For example for rpm, that would be the files in the repodata directory right?

#6 Updated by 2 months ago

@bmbouter yes, exactly that. verify that the metadata matches the set of files in the published repository to ensure none are missing, corrupt or don't match the size/checksum

#7 Updated by bmbouter 2 months ago

Which content types would this be for? Unlike the other features of this report which can be done generically, this metadata checking has to be implemented differently for each content type. That is ok, but each one will take additional effort. Which types do you need this for?

#8 Updated by rchan 2 months ago

There was earlier mention of RPM only. Can we confirm and clarify that this is the only content type planned to be supported with this issue?

#9 Updated by about 2 months ago

RPM only is completely acceptable

#10 Updated by bmbouter about 2 months ago

I think the easiest way to have this tool inspect each published RPM repo's metadata is with the Python bindings of createrepo_c

For example, you can load the RepoMetadata by path with xml_parse_repomd. Then when it returns Python objects representing the metadata, those can be checked against the filesystem for correctness.

#11 Updated by about 1 month ago

  • Groomed changed from No to Yes

#12 Updated by ttereshc about 1 month ago

  • Sprint Candidate changed from No to Yes

#13 Updated by ttereshc about 1 month ago

Does it make sense to include into the report information about RPMs which are on a filesystem but no longer exist in DB?
This is for the case of the manual changes - either in DB (orphaned unit was removed directly from DB) or on a filesystem (RPM was copied to /var/lib/pulp/content)

#16 Updated by rchan 29 days ago

  • Sprint/Milestone set to 56

#17 Updated by bmbouter 9 days ago

  • Sprint set to Sprint 33

#18 Updated by bmbouter 9 days ago

  • Sprint/Milestone deleted (56)

#19 Updated by 3 days ago

  • Sprint deleted (Sprint 33)

Please register to edit this issue

Also available in: Atom PDF