Project

Profile

Help

Task #1600

Store content using consistent and deterministic paths

Added by jortel@redhat.com over 4 years ago. Updated over 1 year ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.8.0
Groomed:
Yes
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:

Description

Store content using consistent and deterministic paths. In 2.8, the platform determines the _storage_path (with some help from plugins). It determines the path using the UUID as:

/var/lib/pulp/content/units/<type_id>/<unit_id>[0:4]/<unit_id>/<file>

To support recovery scenarios, using the hash of unit_key instead of unit_id will be required. This makes the storage path deterministic.

Suggested:

/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>

Related issues

Blocks RPM Support - Story #236: Don't re-download rpms if they exist on diskCLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision ec3fafac View on GitHub
Added by jortel@redhat.com over 4 years ago

Unit storage path based on sha256 of unit key instead of using UUID. closes #1600

Revision ec3fafac View on GitHub
Added by jortel@redhat.com over 4 years ago

Unit storage path based on sha256 of unit key instead of using UUID. closes #1600

History

#1 Updated by bmbouter over 4 years ago

In the suggestion, when you say hash do you mean <hash>?

Also can you remind me again what motivates the hash[0:4] part of the storage path? I figured the layout format would be:

/var/lib/pulp/content/units/<type_id>/<hash>/<file>

#2 Updated by jortel@redhat.com over 4 years ago

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

#3 Updated by bmbouter over 4 years ago

  • Description updated (diff)

wrote:

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.

I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?

#4 Updated by jortel@redhat.com over 4 years ago

bmbouter wrote:

wrote:

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.

I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?

The idea comes from how ostree stores files in objects/. I think the expectation is that given enough values, that the 1st 4 digits of the hash would duplicate sufficiently to build additional directories. Although, now that you ask, I think the probability of this is worth investigating. This really should be: units/type_id/<hash>[0:4]/<hash>[4:]/<file>

For example: hash values (shortened for illustration):

123401231
1234A1232
1234B1233
123501234
1235A1235
1235B1236
475649487
994049858

Would produce this tree:

1234/
   01231/
   A1232/
   B1232/
1235/
   01234/
   A1235/
   B1236/
4756/
   49487/
9940/
   49858/

#5 Updated by bmbouter over 4 years ago

Oh that is interesting. I had not considered that.

I expect the hashing algorithm to provide good randomness in terms of the first 4 characters produced. Assuming the hash algorithm fills the possible combinations evenly and only allowing characters [A-Z] and [0-9], we would expect duplicates after 36^4 units are in the database. That's 1,679,616 units.

Given that, I propose:

/var/lib/pulp/content/units/<type_id>/<hash>/<file>

What do you think given all of this?

#6 Updated by bmbouter over 4 years ago

  • Blocks Story #236: Don't re-download rpms if they exist on disk added

#7 Updated by jortel@redhat.com over 4 years ago

I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.

#8 Updated by bmbouter over 4 years ago

wrote:

I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.

Nice test! +1 to keeping it as it is written in the issue:

/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>

#9 Updated by jcline@redhat.com over 4 years ago

I don't know how many units people keep in Pulp of a given type, and I don't know what filesystems people use. I do know that ext3 has sub-directory limits of 31998, which is less than 36**3 (I'm assuming a 36 character alphabet for our hash language). I also suspect that performance starts to degrade pretty seriously with large numbers of sub-directories, but without knowing the nitty-gritty details of each filesystem, that's a bit hand-wavy.

My personal opinion is that a prefix of 2 quite sufficient. That's 1296 possible sub-directories (assuming a 36 character alphabet) and if they are evenly distributed we should, on average, maintain sub-directory counts less than or equal to 1296 until well over 1 million units. I don't think an even distribution is necessary guarantee of a hashing algorithm, but it's not hugely damaging if we assume it is will be near to even.

#10 Updated by jortel@redhat.com over 4 years ago

Using 2 digits seems to produce more favorable results. Running the same test for 100,000 units using hash[0:2] and SHA256 results in creating 256 directories each containing 300-400 subsirectories.

#12 Updated by jortel@redhat.com over 4 years ago

If only using 2 digits, I'm fine with either.

#13 Updated by mhrivnak over 4 years ago

+1 to all of this. 2 characters should be enough and is a common approach I've seen in other situations.

One additional motivation: on some filesystems, there can be a real performance impact on file access if a directory listing gets very large. Especially on some older filesystems, the directory data structure is not optimized for random access.

#14 Updated by jortel@redhat.com over 4 years ago

  • Status changed from NEW to ASSIGNED

#15 Updated by jortel@redhat.com over 4 years ago

  • Status changed from ASSIGNED to POST

#16 Updated by jortel@redhat.com over 4 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#17 Updated by dkliban@redhat.com over 4 years ago

  • Status changed from MODIFIED to 5

#18 Updated by dkliban@redhat.com over 4 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#19 Updated by bmbouter over 1 year ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF