Project

Profile

Help

Story #2619

Need a file-system integrity report for /var/lib/pulp

Added by jortel@redhat.com about 1 year ago. Updated 3 days ago.

Status:
POST
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
% Done:

0%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:
Sprint 37

Description

In certain cases due to historical issues with Satellite or on-going problems during content manipulation, there are situations where various inconsistencies can exist with the files contained within /var/lib/pulp

We have customers who are going to start deploying the new 'repair' facilities in this feature we are adding:

https://bugzilla.redhat.com/show_bug.cgi?id=1223023

  • [RFE] Allow Pulp to verify/repair corrupted packages in a repository

with the addition of the repair side of this feature we need a way to identify the following conditions:

  • Missing RPMs from /var/lib/pulp/content
  • Corrupt/NOT OK md5sums on any unit in /var/lib/pulp/content
  • invalid repositories contained within /var/lib/pulp/published where the yum metadata points at sylinks that are missing
  • missing or broken symlinks for published repositories for Content Views

EG:

Source: /var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-Library-rhel7-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_RPMs_x86_64_7Server/1472243944.1

Target: /var/lib/pulp/published/yum/https/repos/Default_Organization/Library/rhel7/content/dist/rhel/server/7/7Server/x86_64/os

May add more criteria to check but in order to restore confidence in the integrity of /var/lib/pulp, we need to be able to report on the state of this sub-directory.

Runtimes to generate this report are expected to be very long but this should not be a blocker for the implementation

History

#1 Updated by jortel@redhat.com 7 months ago

I imagine this would be a stand alone tool (script) that runs on each satellite/capsule and writes a report. The tool should display progress when possible and write a file containing the report.

Like:

$ tool -h

  -s  validate stored content file exists and match size/checksum when known.
  -b  validate symlinks (find broken)
  -m  validate that published metadata references valid symlinks
  -a  validate all.

  -p  restrict publishing validation to a specific directory.  default: /var/lib/pulp/published
  -o  path to generated report.

Some of the validation will require the tool to have content type specific knowledge so we need to determine which content types need to be supported. RPM has been specifically requested so let's start with that. The tool needs to be designed to support adding validation for additional content types as requested.

The tool should grab information about stale publishes and possibly qualify broken symlinks to reduce/eliminate false positives. Publishing is stale when the repository is sync'd after the last publish.

The report should have a heading for each test followed by summary and list of errors.

#2 Updated by mmccune@redhat.com 7 months ago

Requirement: if we are going to be using a CLI tool, we will need to machine parse the output of the report in other tools so it would be required to have an option to output in JSON or CSV in the least. JSON preferable.

#3 Updated by bmbouter 6 months ago

I think we should determine the expected data output and get the user to ack that it meets their needs before starting on the work. +1 to using json. Here is a half-proposal and a question:

{
"schema": 1,
"corrupted_content": ["/path/to/corrupted/file1", "/path/to/corrupted/file1"],
"broken_symlinks": ["/path/to/broken/symlink1", "/path/to/broken/symlink2"]
}

The schema defines the schema version of this.
The corrupted_content is a list of paths to corrupted content
The broken_symlinks is a list of broken symlinks

I'm not sure what the output format should be for the "published metadata does not reference valid symlinks" part of the report. Can someone write an example of what a failure like this would be. Even a written example (not a json example would be good).

#4 Updated by mmccune@redhat.com 5 months ago

+1 to the above report.

Perhaps for the metadata corruption we could just list the path to each set of metadata that could be considered 'invalid' where it references things that don't exist or are corrupt.

..
"invalid_metadata": ["/path/to/repodata/dir1", "/path/to/repodata/dir2"]
..

#5 Updated by bmbouter 4 months ago

Thanks for the info @mmccune. For the inavlid_metadata is that feature intended to inspect the published metadata and the files in the published repo and make sure all metadata entries refer to files that are present? For example for rpm, that would be the files in the repodata directory right?

#6 Updated by mmccune@redhat.com 4 months ago

@bmbouter yes, exactly that. verify that the metadata matches the set of files in the published repository to ensure none are missing, corrupt or don't match the size/checksum

#7 Updated by bmbouter 4 months ago

Which content types would this be for? Unlike the other features of this report which can be done generically, this metadata checking has to be implemented differently for each content type. That is ok, but each one will take additional effort. Which types do you need this for?

#8 Updated by rchan 4 months ago

There was earlier mention of RPM only. Can we confirm and clarify that this is the only content type planned to be supported with this issue?

#9 Updated by mmccune@redhat.com 4 months ago

RPM only is completely acceptable

#10 Updated by bmbouter 4 months ago

I think the easiest way to have this tool inspect each published RPM repo's metadata is with the Python bindings of createrepo_c

For example, you can load the RepoMetadata by path with xml_parse_repomd. Then when it returns Python objects representing the metadata, those can be checked against the filesystem for correctness.

#11 Updated by jortel@redhat.com 3 months ago

  • Groomed changed from No to Yes

#12 Updated by ttereshc 3 months ago

  • Sprint Candidate changed from No to Yes

#13 Updated by ttereshc 3 months ago

Does it make sense to include into the report information about RPMs which are on a filesystem but no longer exist in DB?
This is for the case of the manual changes - either in DB (orphaned unit was removed directly from DB) or on a filesystem (RPM was copied to /var/lib/pulp/content)

#16 Updated by rchan 3 months ago

  • Sprint/Milestone set to 56

#17 Updated by bmbouter 3 months ago

  • Sprint set to Sprint 33

#18 Updated by bmbouter 3 months ago

  • Sprint/Milestone deleted (56)

#19 Updated by jortel@redhat.com 2 months ago

  • Sprint deleted (Sprint 33)

#20 Updated by daviddavis@redhat.com about 2 months ago

  • Sprint set to Sprint 35

#21 Updated by ttereshc about 2 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to milan

#22 Updated by milan about 1 month ago

Performing some initial investigation...

Seems the easiest way to get all plug-in unit models to validate some common attributes of, is to use the PluginManager:

from pulp.plugins.loader import manager

pm = manager.PluginManager()
pm.unit_models == {'distribution': pulp_rpm.plugins.db.models.Distribution,
 'drpm': pulp_rpm.plugins.db.models.DRPM,
 'erratum': pulp_rpm.plugins.db.models.Errata,
 'iso': pulp_rpm.plugins.db.models.ISO,
 'package_category': pulp_rpm.plugins.db.models.PackageCategory,
 'package_environment': pulp_rpm.plugins.db.models.PackageEnvironment,
 'package_group': pulp_rpm.plugins.db.models.PackageGroup,
 'package_langpacks': pulp_rpm.plugins.db.models.PackageLangpacks,
 'rpm': pulp_rpm.plugins.db.models.RPM,
 'srpm': pulp_rpm.plugins.db.models.SRPM,
 'yum_repo_metadata_file': pulp_rpm.plugins.db.models.YumMetadataFile} 

I'd like to introduce pulp.validators entry point that core would populate with e.g the symlink validation, and plug-ins could customize.
The utility would map the validators on the content units.

A validator would have an applicable callable to check for it's applicability on particular content unit type.
The default behavior of would be to register a validator against unit types of its own module?

A unit can fail multiple validations; the record of failures should reflect that:


ValidationFailure = namedtuple('ValidationFailure', ('validator', 'unit', 'error'))
ValidationSuccess = namedtuple('ValidationSuccess', ('validator', 'unit'))

class ValidationError(Exception):
    """A generic validation error.""" 

class Validator(object):
    # explicitly applicable to everything
    applicable = classmethod(lambda unused: True)

    def __call__(self, unit):
    """The interface of a validator.

        Returns validation result or None if the validator isn't applicable for the content unit.
     """ 
       if not self.applicable(unit):
           return

        try:
            self.validate(unit)
        except ValidationFailed as exc:
            return ValidationFailure(self, unit, exc))
        else:
            return ValidationSuccess(self, unit) 

     def validate(self, unused):
          """A very useful content unit validation.

         To be overridden by a subclass.
          Raise a ValidationError instance to declare a validation failed.
          """ 
          pass

Some generic ChecksumValidator or SymlinkValidator would be provided by the core utility part.

Publication validation is going to be plug-in specific therefore plug-ins might want to provide publication validators as entry points thru pulp.validators; for instance, RPM publish check might reuse some plug-in importer code (e.g for parsing metadata).
A fix to this issue should include such a validator implementation in the pulp_rpm plug-in.

#23 Updated by bmbouter about 1 month ago

I don't think the goal of this ticket is to "get all plug-in unit models to validate". The tool request is laser focused on RPM support (comment 9). Introducing general mechanisms to extend this report isn't part of the desire and would introduce risk to Pulp2 regressions (I think).

Did you see this comment that talks about using createrepo_c? https://pulp.plan.io/issues/2619?pn=1#note-10 Do you see how that could be used to make this tool?

#24 Updated by bmbouter about 1 month ago

My comment 23 suggested createrepo_c be used to do 100% of the features of this ticket, but that is not right. I had meant for it to only be used for the invalid_metadata section. Comment 5 talks some more about technically how to do that.

I also misunderstood the broken_symlinks requirement. I thought it literally meant broken symlinks, but @ttereshc pointed out that lazy downloaded content can have valid symlinks which point to nowhere. @ttereshc also convinced me that in order to know the difference between a "broken symlink" and one that is "a symlink to not-yet-delivered lazy content" we'll need to query data from the Pulp database.

#25 Updated by milan about 1 month ago

So this isn't really posted yet, but I've got some success to report with a patch that still needs to be PR-ed:

(pulp) [vagrant@pulp2 integrity]$ sudo -u apache python integrity.py

{
 "unit_failures": [
  {
   "validator": "ChecksumValidator",
   "unit": "rpm: lemon-0-0-1-noarch-sha256-5fff03684c9e503bed5cfdddb327318ad4e2175073811d9d97fcadc3113fbfec",
   "repository": "rich-deps-sync",
   "error": "Unit rpm: lemon-0-0-1-noarch-sha256-5fff03684c9e503bed5cfdddb327318ad4e2175073811d9d97fcadc3113fbfec, storage path /var/lib/pulp/content/units/rpm/2b/e42bb9d8d3a0eda1dff939766f4ff5e67244fe0536a1fc8f842c088df7edb4/7adf3b78-acc9-4fed-ac1e-c5855fdb83e9 has an invalid checksum." 
  },
  {
   "validator": "ChecksumValidator",
   "unit": "rpm: lemon-0-0-1-noarch-sha256-5fff03684c9e503bed5cfdddb327318ad4e2175073811d9d97fcadc3113fbfec",
   "repository": "rich-deps-copy",
   "error": "Unit rpm: lemon-0-0-1-noarch-sha256-5fff03684c9e503bed5cfdddb327318ad4e2175073811d9d97fcadc3113fbfec, storage path /var/lib/pulp/content/units/rpm/2b/e42bb9d8d3a0eda1dff939766f4ff5e67244fe0536a1fc8f842c088df7edb4/7adf3b78-acc9-4fed-ac1e-c5855fdb83e9 has an invalid checksum." 
  },
  {
   "validator": "ChecksumValidator",
   "unit": "rpm: lemon-0-0-1-noarch-sha256-5fff03684c9e503bed5cfdddb327318ad4e2175073811d9d97fcadc3113fbfec",
   "repository": "rich-deps",
   "error": "Unit rpm: lemon-0-0-1-noarch-sha256-5fff03684c9e503bed5cfdddb327318ad4e2175073811d9d97fcadc3113fbfec, storage path /var/lib/pulp/content/units/rpm/2b/e42bb9d8d3a0eda1dff939766f4ff5e67244fe0536a1fc8f842c088df7edb4/7adf3b78-acc9-4fed-ac1e-c5855fdb83e9 has an invalid checksum." 
  },
  {
   "validator": "BrokenSymlinksValidator",
   "unit": "rpm: tablespoon-sugar-0-1-0-noarch-sha256-5591731a1f07b7f1bd7be3777876f5f43e91611619ec62c41fa1d4a038152cc6",
   "repository": "rich-deps",
   "error": "Unit rpm: tablespoon-sugar-0-1-0-noarch-sha256-5591731a1f07b7f1bd7be3777876f5f43e91611619ec62c41fa1d4a038152cc6, has a broken symlink /var/lib/pulp/published/yum/https/repos/rich-deps/Packages/t/tablespoon-sugar-1-0.noarch.rpm in repo rich-deps." 
  }
 ],
 "broken_repos": [
  {
   "validator": "AffectedRepositoriesValidator",
   "repository": "rich-deps" 
  },
  {
   "validator": "AffectedRepositoriesValidator",
   "repository": "rich-deps-copy" 
  },
  {
   "validator": "AffectedRepositoriesValidator",
   "repository": "rich-deps-sync" 
  }
 ],
 "dark_matter": [
  {
   "validator": "DarkContentValidator",
   "path": "/var/lib/pulp/content/units/rpm/2b/e42bb9d8d3a0eda1dff939766f4ff5e67244fe0536a1fc8f842c088df7edb4/deadbeef-dead-beefdead" 
  }
 ]
}

#26 Updated by rchan about 1 month ago

  • Sprint changed from Sprint 35 to Sprint 36

#27 Updated by milan about 1 month ago

  • Status changed from ASSIGNED to POST

#28 Updated by milan about 1 month ago

I'd like to suggest change of the output format; the current one requires holding reference to all detected issues just to dump them at the end of execution, which wastes memory. Also the flexibility of queries into current nested structure is quite limited. Instead, I'd like to propose a flat(ter) one:

{
    "report": [
           {
               "error": "Invalid Size.....",
               "repository": "Zoo",
               "unit": "rpm: Lion....." 
               "validator": "SizeValidator(version=0.1)" 
           },
           {
                "error" : "....",
                "repository" : "....",
                "unit": "....",
                "validator": "....",
           }
    ]
}

So that folks can query later on, even over saved results.
For instance, to obtain list of affected repositories, one could do:

sudo -u apache pulp-integrity --check size --check broken_symlinks | jq '[.report[].repository] | unique'

[
  "rich-deps",
  "rich-deps-sync" 
]

The attached PR can be used to test this new proposal.

#29 Updated by milan 30 days ago

Added a pulp_rpm specific plug-in https://github.com/pulp/pulp_rpm/pull/1104

#30 Updated by dkliban@redhat.com 22 days ago

Upon further discussion of the problem I believe this tool needs to use a SQLite database to store information about content, symlinks, repositories, and repo metadata. The first step of using the tool would be to generate a DB with all the information about the filesystem. The second step is generating a report from the information in the DB.

The first step needs to:

  • find all content in /var/lib/pulp/content/rpm and store in a DB table
    - calculate checksums
  • find all the symlinks in /var/lib/pulp/published/yum/https/ and store in another table
    - determine if the symlink is broken and if it's supposed to be not because the download catalog says it was downloaded
    - if not broken, check if the checksum stored in the Pulp database matches the checksum calculated for content the link is point to
  • find all repositories published in /var/lib/pulp/published/yum/https/ and store in another table
    - determine if the repo metadata references any broken symlinks found in the previous step

The second step should generate a report from the SQLite database.

#31 Updated by bmbouter 22 days ago

Since the tool runs and computes checksums it will run for a looooong time. Having a sqlitedb would allow it to "resume" which for the runtime of the tool is a required feature.

#32 Updated by milan 22 days ago

I disagree with such a suggestion for the following reasons:

  • conceptual issue: creates another source of inconsistency by duplicating data from the disc and the Mongo in the SQLite DB file, esp. if the traits accumulation gets interrupted
  • usage difficulty:
    • checks thru SQL queries; debug your query before you can trust it vs delegate the trust on Pulp engineering to assemble the check you need on your behalf
    • exposing the query language thru the tool CLI
  • inefficient; requires gathering of every trait before being able to returning any results
  • integration would require creating a library of predefined SQL expressions for issue queries, to keep the report results consistent across the users (e.g BrokenSymlink as a model)

The value I see is the generic, pure and expressive SQL rather than a domain-specific language (on the CLI)

#33 Updated by dkliban@redhat.com 21 days ago

The solution I outlined in comment 30 does not imply that the user would need to write SQL. The author of this tool needs to write all the SQL queries needed to build the report. The tool just needs to provide the user with an option to print the report to the screen.

#34 Updated by milan 21 days ago

In that case I don't see any value added.

#35 Updated by dkliban@redhat.com 21 days ago

Since this tool will be running for a long time, it needs to support interruption and resuming. The database will allow to add such a feature.

#36 Updated by milan 21 days ago

Please, see the first bullet in Comment 32.

#37 Updated by dkliban@redhat.com 21 days ago

If the tool is being run while Pulp is still fully functional, the inconsistencies you are referring to are probably going to occur with both approaches. Any symlinks published while the tool is running will likely to be missed by the tool.

Using a database will allow the user to re-run the tool a second time. The second time the tool can check in its database to see if it had already encountered the content in the previous run. If it had, there is no need to calculate the checksum again. When the tool looks at symlinks for the second time it can use the detabase to determine if it had already examined that symlink or not. If it had already recorded all the information about it, then the tool can move on to the next symlink without checking it. On the second run, the tool will be able to find any new symlinks that got added since the first run.

#38 Updated by milan 21 days ago

wrote:

Using a database will allow the user to re-run the tool a second time.
The second time the tool can check in its database to see if it had already encountered the content in the previous run. If it had, there is no need to calculate the checksum again.
When the tool looks at symlinks for the second time it can use the detabase to determine if it had already examined that symlink or not.
If it had already recorded all the information about it, then the tool can move on to the next symlink without checking it.
On the second run, the tool will be able to find any new symlinks that got added since the first run.

What if you encounter an SQLite record that meanwhile disappeared from the disk?
What if an SQLite record got meanwhile purged from both the MongoDB and the disk?

Having been interrupted, the tool to give sound results has to re-evaluate its SQLite database against the disk and the MongoDB, doesn't it?
So there's no benefit in continuing it's run when compared to a from-scratch run, is it?

If the tool is being run while Pulp is still fully functional, the inconsistencies you are referring to are probably going to occur with both approaches. Any symlinks published while the tool is running will likely to be missed by the tool.

yes of course. I believe I can achieve feature parity here by recording the last-checked unit UUID.

#39 Updated by jortel@redhat.com 14 days ago

I think there is value in collecting information about content and symlinks in a sqlite DB. Both collecting the symlink information in deployments using NFS and calculating checksums will be expensive. Doing this once and keeping track of what has been collected (or calculated) is the most practical approach especially if restarting the tools is requirement. The tool should provide an option to purge cached data for cases where a significant amount of pulp activity has happened between runs of the tool. The concern re: anomalies between the cached (in sqlite DB) data and pulp/filesystem is understandable but unavoidable given that the tool is working against a live pulp.

#40 Updated by milan 13 days ago

wrote:

I think there is value in collecting information about content and symlinks in a sqlite DB. Both collecting the symlink information in deployments using NFS and calculating checksums will be expensive.

The proposed DB-based approach requires all the traits to be stored before being able to infer results, so it will require much more resources/time esp. in the NFS case.
For instance, to be able to infer absence of a symlink, one would have to store both possible locations option presence values (old way and new way) in the DB.
Moreover, what's the point of calculating checksum of a file with an invalid size and store these traits in the DB before inferring a report in the second step?

Doing this once and keeping track of what has been collected (or calculated) is the most practical approach especially if restarting the tools is requirement.

Please, could you elaborate on how would the "keeping track" look like exactly?

The tool should provide an option to purge cached data for cases where a significant amount of pulp activity has happened between runs of the tool.

I wonder how exactly would one define "significant amount of pulp activity" to objectively state when to purge.

The concern re: anomalies between the cached (in sqlite DB) data and pulp/filesystem is understandable but unavoidable given that the tool is working against a live pulp.

Yes. Both the approaches have the same drawback: the moment you run a tool based on either approach the result it generates is outdated. Hence the DB approach is no better than my current approach in this regard, so I see no value added with regards of being able to continue the computation after a break; I can as well just remember the last mongo unit visited (paginate).

#41 Updated by rchan 13 days ago

  • Sprint changed from Sprint 36 to Sprint 37

#42 Updated by jortel@redhat.com 10 days ago

milan wrote:

wrote:
The proposed DB-based approach requires all the traits to be stored before being able to infer results, so it will require much more resources/time esp. in the NFS case.

Agreed. Storing the information in the DB will take more resources. Why especially with NFS?

For instance, to be able to infer absence of a symlink, one would have to store both possible locations option presence values (old way and new way) in the DB.

Agreed.

Moreover, what's the point of calculating checksum of a file with an invalid size and store these traits in the DB before inferring a report in the second step?

The point of storing the checksum in the DB is so it's not re-calculated if the tools is restarted.

The tool should provide an option to purge cached data for cases where a significant amount of pulp activity has happened between runs of the tool.

I wonder how exactly would one define "significant amount of pulp activity" to objectively state when to purge.

I imagined this would be solely at the discretion of the Admin. I don't see how it could be determined.

#43 Updated by jortel@redhat.com 10 days ago

The requirement for the tool to be resumable is the main driver for me to support caching in sqllite. It's my belief that by caching the data, we can limit the number of hits to an NFS mount and the number of times the tool would calculate content checksum. Is there a way to achieve these efficiencies without caching that I'm missing?

#44 Updated by milan 10 days ago

wrote:

milan wrote:

wrote:
The proposed DB-based approach requires all the traits to be stored before being able to infer results, so it will require much more resources/time esp. in the NFS case.

Agreed. Storing the information in the DB will take more resources. Why especially with NFS?

Because any redundant work will make the tool take longer to finish with NFS than with a "local" filesystem.

For instance, to be able to infer absence of a symlink, one would have to store both possible locations option presence values (old way and new way) in the DB.

Agreed.

Moreover, what's the point of calculating checksum of a file with an invalid size and store these traits in the DB before inferring a report in the second step?

The point of storing the checksum in the DB is so it's not re-calculated if the tools is restarted.

How exactly would one be able to tell whether a unit checksum is missing from the DB after a restart? How about units removed or added between the stop and the restart of the tool?
How would the orphaned units be excluded from the checksum calculation? And how the "dark" content would?

The tool should provide an option to purge cached data for cases where a significant amount of pulp activity has happened between runs of the tool.

I wonder how exactly would one define "significant amount of pulp activity" to objectively state when to purge.

I imagined this would be solely at the discretion of the Admin. I don't see how it could be determined.

Agreed.

#45 Updated by milan 10 days ago

wrote:

The requirement for the tool to be resumable is the main driver for me to support caching in sqllite. It's my belief that by caching the data, we can limit the number of hits to an NFS mount and the number of times the tool would calculate content checksum. Is there a way to achieve these efficiencies without caching that I'm missing?

My current approach is to enumerate content units to avoid redundant disk hits; resuming can be achieved by adopting pagination and dealing with sigterm

$ pulp-integrity --last-unit: <foo-id>  --check size --check checksum

I'd say the disk caching is best left up to the particular filesystem and driver, though I agree batch-accessing the various filesystem content unit (symlink) traits might prove more efficient than random-access as the various system readahead buffers might get flushed more often with the latter approach.

#46 Updated by milan 3 days ago

Having spoken with @ttereshc I realized I have been ignoring one use case: the availability of the collected inspection data outside of the scope of the integrity tool the SQLite file would provide. In this usecase an advanced user would like to draw their conclusions based on a custom, hand-made SQL query rather than on the integrity tool report, the usability of the data being proportional to the collected traits amount. Providing the same functionality thru current implementation would be cumbersome as it goes the opposite direction: reduce the traits collecting as much as possible.

Please register to edit this issue

Also available in: Atom PDF