Project

Profile

Help

Task #1600

closed

Store content using consistent and deterministic paths

Added by jortel@redhat.com almost 9 years ago. Updated over 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.8.0
Groomed:
Yes
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Store content using consistent and deterministic paths. In 2.8, the platform determines the _storage_path (with some help from plugins). It determines the path using the UUID as:

/var/lib/pulp/content/units/<type_id>/<unit_id>[0:4]/<unit_id>/<file>

To support recovery scenarios, using the hash of unit_key instead of unit_id will be required. This makes the storage path deterministic.

Suggested:

/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>

Related issues

Blocks RPM Support - Story #236: Don't re-download rpms if they exist on diskCLOSED - CURRENTRELEASEmhrivnak

Actions
Actions #1

Updated by bmbouter almost 9 years ago

In the suggestion, when you say hash do you mean <hash>?

Also can you remind me again what motivates the hash[0:4] part of the storage path? I figured the layout format would be:

/var/lib/pulp/content/units/<type_id>/<hash>/<file>
Actions #2

Updated by jortel@redhat.com almost 9 years ago

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

Actions #3

Updated by bmbouter almost 9 years ago

  • Description updated (diff)

wrote:

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.

I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?

Actions #4

Updated by jortel@redhat.com almost 9 years ago

bmbouter wrote:

wrote:

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.

I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?

The idea comes from how ostree stores files in objects/. I think the expectation is that given enough values, that the 1st 4 digits of the hash would duplicate sufficiently to build additional directories. Although, now that you ask, I think the probability of this is worth investigating. This really should be: units/type_id/<hash>[0:4]/<hash>[4:]/<file>

For example: hash values (shortened for illustration):

123401231
1234A1232
1234B1233
123501234
1235A1235
1235B1236
475649487
994049858

Would produce this tree:

1234/
   01231/
   A1232/
   B1232/
1235/
   01234/
   A1235/
   B1236/
4756/
   49487/
9940/
   49858/
Actions #5

Updated by bmbouter almost 9 years ago

Oh that is interesting. I had not considered that.

I expect the hashing algorithm to provide good randomness in terms of the first 4 characters produced. Assuming the hash algorithm fills the possible combinations evenly and only allowing characters [A-Z] and [0-9], we would expect duplicates after 36^4 units are in the database. That's 1,679,616 units.

Given that, I propose:

/var/lib/pulp/content/units/<type_id>/<hash>/<file>

What do you think given all of this?

Actions #6

Updated by bmbouter almost 9 years ago

  • Blocks Story #236: Don't re-download rpms if they exist on disk added
Actions #7

Updated by jortel@redhat.com almost 9 years ago

I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.

Actions #8

Updated by bmbouter almost 9 years ago

wrote:

I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.

Nice test! +1 to keeping it as it is written in the issue:

/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>
Actions #9

Updated by jcline@redhat.com almost 9 years ago

I don't know how many units people keep in Pulp of a given type, and I don't know what filesystems people use. I do know that ext3 has sub-directory limits of 31998, which is less than 36**3 (I'm assuming a 36 character alphabet for our hash language). I also suspect that performance starts to degrade pretty seriously with large numbers of sub-directories, but without knowing the nitty-gritty details of each filesystem, that's a bit hand-wavy.

My personal opinion is that a prefix of 2 quite sufficient. That's 1296 possible sub-directories (assuming a 36 character alphabet) and if they are evenly distributed we should, on average, maintain sub-directory counts less than or equal to 1296 until well over 1 million units. I don't think an even distribution is necessary guarantee of a hashing algorithm, but it's not hugely damaging if we assume it is will be near to even.

Actions #10

Updated by jortel@redhat.com almost 9 years ago

Using 2 digits seems to produce more favorable results. Running the same test for 100,000 units using hash[0:2] and SHA256 results in creating 256 directories each containing 300-400 subsirectories.

Actions #12

Updated by jortel@redhat.com almost 9 years ago

If only using 2 digits, I'm fine with either.

Actions #13

Updated by mhrivnak almost 9 years ago

+1 to all of this. 2 characters should be enough and is a common approach I've seen in other situations.

One additional motivation: on some filesystems, there can be a real performance impact on file access if a directory listing gets very large. Especially on some older filesystems, the directory data structure is not optimized for random access.

Actions #14

Updated by jortel@redhat.com almost 9 years ago

  • Status changed from NEW to ASSIGNED
Actions #15

Updated by jortel@redhat.com almost 9 years ago

  • Status changed from ASSIGNED to POST

Added by jortel@redhat.com almost 9 years ago

Revision ec3fafac | View on GitHub

Unit storage path based on sha256 of unit key instead of using UUID. closes #1600

Added by jortel@redhat.com almost 9 years ago

Revision ec3fafac | View on GitHub

Unit storage path based on sha256 of unit key instead of using UUID. closes #1600

Actions #16

Updated by jortel@redhat.com almost 9 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #17

Updated by dkliban@redhat.com almost 9 years ago

  • Status changed from MODIFIED to 5
Actions #18

Updated by dkliban@redhat.com over 8 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #19

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF