Project

Profile

Help

Story #1647

closed

Unify checksum management to the platform and add some features

Added by bmbouter about 8 years ago. Updated about 5 years ago.

Status:
CLOSED - WONTFIX
Priority:
High
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
Yes
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

The platform should be handling checksum management both for syncing and publishing. We want a few new features in the area of checksums, so moving it to platform ensures new features will be available on all content types. A side-benefit is also that plugin writers won't have to do as much with checksums.

The first step is to make a plan for bringing checksum management into the platform. That will likely involve backwards incompatible changes so this story will likely be moved to 3.0.0 once a plan is written. Please write the plan somewhere like a public etherpad for discussion.


Related issues

Related to RPM Support - Issue #1618: --checksum-type is brokenCLOSED - CURRENTRELEASEmhrivnakActions
Actions #1

Updated by bmbouter about 8 years ago

  • Related to Issue #1619: as user, I can export repo groups with different checksum than sha256 added
Actions #2

Updated by bmbouter about 8 years ago

  • Related to Issue #1618: --checksum-type is broken added
Actions #3

Updated by Anonymous about 8 years ago

  • Sprint/Milestone set to 19
Actions #4

Updated by jcline@redhat.com almost 8 years ago

I feel like this issue is part of a larger issue, namely designing a plugin API. Furthermore, I imagine this 'unification' requires actually tracking individual files in Pulp with metadata requires to validate the integrity of the file (checksums/checksum types, size, maybe permissions/ownership, location, etc). Our current data models don't do this.

Since we don't have either of those things fleshed out, would it be reasonable figure out what we're doing there before trying to work on this story? Or is this about having a very high-level idea of where we'd like to be?

Actions #5

Updated by jcline@redhat.com almost 8 years ago

  • Related to deleted (Issue #1619: as user, I can export repo groups with different checksum than sha256)
Actions #6

Updated by bmbouter almost 8 years ago

@jcline I agree with all of the observations you made, especially the part that requires us to know what we want out of this story.

For myself, I was thinking the latter thing you suggested ... This story is to write out "a very high-level idea of where we'd like to be".

Actions #7

Updated by jcline@redhat.com almost 8 years ago

Okay, given that, how about I outline what I'm thinking and if no one objects we can edit the description on this story.

Content Integrity in Pulp

As a user of Pulp, I would like to ensure the content I am serving to my clients is correct (that is to say, it is what I think it is). Content can become incorrect for several reasons. The most likely reason for incorrect content is probably a bug in either Pulp's code, or one of the libraries we use. However, these bugs are not the only reason content could be incorrect. Bit rot can occur, as can bugs in hard drive firmware[0]. Some file systems are capable of detecting bit rot of whole files and potentially repairing it[1][2][3], but many do not (like ext4 and XFS). Even if they could, we should not rely on a software layer below us for the integrity of the content we manage.

Things we want to be able to do:

  • Tell Pulp to check every file's integrity (a pulp-scrub if you will)
  • Tell Pulp to attempt to fix problems it finds (re-download the file or similar)
  • Tell Pulp a particular file is bad and it should retrieve it again (I'm thinking about content types that don't have checksums as part of their metadata, or to recover from bugs in the two bullet points above)

How Our Data Model Must Change

To have any chance of providing content integrity validation and potential repair (a pulp-scrub, if you will), we must track each and every file for all content units. We don't currently do this. There are multi-file units (like Distributions, and maybe OSTree?). Information we probably want to track for each file:

  • checksum
  • checksum_type (although we might want to just stick with sha256 and not tie this to any potential metadata we know about the file)
  • size of the file
  • the origin of the file
  • the storage location of the file
  • access control settings?

It might not be a bad idea to also track the integrity of the this metadata by hashing it and storing the hash with it. This should happen for every file we manage, regardless of content type. Therefore, this should live in the platform and leads us to...

Plugin API

We need to define a plugin API with this feature in mind. It could potentially happen as part of the file retrieval. The user would provide a URL and if its available to them, the data integrity information (think RPM's primary.xml metadata file which provides locations and checksums). The platform would handle retrieving the files, validating the download went smoothly (with the provided checksums) or generating the initial checksum, creating the database record for the file, etc. This is just a thought though, and worth fleshing out.

[0] http://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf
[1] https://github.com/gluster/glusterfs-specs/blob/master/done/GlusterFS%203.7/BitRot.md
[2] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-scrub
[3] https://pthree.org/2012/12/11/zfs-administration-part-vi-scrub-and-resilver/

Actions #8

Updated by mhrivnak almost 8 years ago

This is a good plan that will go well with other plans for data model improvements in 3.y.

ostree is a strange case. It may not make sense to track all of its files. Otherwise, this should work well for other content types.

Thinking of this as a whishlist, in at least some cases we want to store a gpg signature with a file.

What do you have in mind for access control settings? I don't think we have anything like that on units today.

When you say "It might not be a bad idea to also track the integrity of the this metadata by hashing it and storing the hash with it.", what are you trying to guard against? Database corruption?

Actions #9

Updated by jcline@redhat.com almost 8 years ago

ostree is a strange case. It may not make sense to track all of its files. Otherwise, this should work well for other content types.

There are, of course, types of content that provide their own integrity checks. That's fine, and whether or not we introduce additional safeguards or not really depends on the situation, but we should ensure there is a way to "get at" that feature of OSTree (or Git, or whatever) to scrub the content unit and repair it.

Thinking of this as a whishlist, in at least some cases we want to store a gpg signature with a file.

It feels like it might be a layer up, abstraction-wise, but I'm not familiar enough with how each content type that supports GPG-signing does it. It's worth investigating further, in any case.

What do you have in mind for access control settings? I don't think we have anything like that on units today.

Just trying to be forward-thinking. Suppose permissions in /var/lib/pulp/content get trashed - it'd be nice to recover from that.

When you say "It might not be a bad idea to also track the integrity of the this metadata by hashing it and storing the hash with it.", what are you trying to guard against? Database corruption?

Sure. I don't know much about the integrity checks databases provide (I'd be surprised if they didn't have some), but I also know people write bugs. It'd be good to have an additional check and a good way to recover when something inevitably goes wrong. I don't feel as strongly about this particular feature because it pales in comparison to the other problems we have, but while we're thinking about it I think it's worth looking into.

Actions #10

Updated by mhrivnak almost 8 years ago

  • Sprint/Milestone changed from 19 to 20
Actions #11

Updated by mhrivnak almost 8 years ago

  • Sprint/Milestone changed from 20 to 21
Actions #12

Updated by jortel@redhat.com almost 8 years ago

Under model changes:

the origin of the file

What is an origin and why would we track it?

Actions #13

Updated by bmbouter almost 8 years ago

  • Sprint/Milestone deleted (21)
  • Platform Release changed from 2.9.0 to 3.0.0
  • Groomed changed from No to Yes
Actions #14

Updated by bmbouter almost 8 years ago

  • Sprint Candidate changed from Yes to No
Actions #15

Updated by jcline@redhat.com almost 8 years ago

wrote:

Under model changes:

the origin of the file

What is an origin and why would we track it?

Where the file came from (a list of urls, I guess) so we can make an attempt to automatically retrieve and repair the file.

Actions #16

Updated by bmbouter about 5 years ago

  • Status changed from NEW to CLOSED - WONTFIX
Actions #17

Updated by bmbouter about 5 years ago

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

Actions #18

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added
Actions #19

Updated by daviddavis about 5 years ago

  • Platform Release deleted (3.0.0)

Also available in: Atom PDF