Task #1600: Store content using consistent and deterministic paths - Pulp

Actions

Send by e-mail Copy link

Task #1600

closed

Store content using consistent and deterministic paths

Added by jortel@redhat.com almost 9 years ago. Updated over 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

Category:

Sprint/Milestone:

Start date:

Due date:

% Done:

100%

Estimated time:

Platform Release:

2.8.0

Groomed:

Yes

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Quarter:

Description

Store content using consistent and deterministic paths. In 2.8, the platform determines the _storage_path (with some help from plugins). It determines the path using the UUID as:

/var/lib/pulp/content/units/<type_id>/<unit_id>[0:4]/<unit_id>/<file>

To support recovery scenarios, using the hash of unit_key instead of unit_id will be required. This makes the storage path deterministic.

Suggested:

/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>

Related issues

Actions

Copy link

Updated by bmbouter almost 9 years ago

In the suggestion, when you say hash do you mean <hash>?

Also can you remind me again what motivates the hash[0:4] part of the storage path? I figured the layout format would be:

/var/lib/pulp/content/units/<type_id>/<hash>/<file>

Actions

Copy link

Updated by jortel@redhat.com almost 9 years ago

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

Actions

Copy link

Updated by bmbouter almost 9 years ago

Description updated (diff)

jortel@redhat.com wrote:

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.

I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?

Actions

Copy link

Updated by jortel@redhat.com almost 9 years ago

bmbouter wrote:

jortel@redhat.com wrote:

The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.

hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.

I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?

The idea comes from how ostree stores files in objects/. I think the expectation is that given enough values, that the 1st 4 digits of the hash would duplicate sufficiently to build additional directories. Although, now that you ask, I think the probability of this is worth investigating. This really should be: units/type_id/<hash>[0:4]/<hash>[4:]/<file>

For example: hash values (shortened for illustration):

Would produce this tree:

Actions

Copy link

Updated by bmbouter almost 9 years ago

Oh that is interesting. I had not considered that.

I expect the hashing algorithm to provide good randomness in terms of the first 4 characters produced. Assuming the hash algorithm fills the possible combinations evenly and only allowing characters [A-Z] and [0-9], we would expect duplicates after 36^4 units are in the database. That's 1,679,616 units.

Given that, I propose:

/var/lib/pulp/content/units/<type_id>/<hash>/<file>

What do you think given all of this?

Actions

Copy link

Updated by bmbouter almost 9 years ago

Blocks Story #236: Don't re-download rpms if they exist on disk added

Actions

Copy link

Updated by jortel@redhat.com almost 9 years ago

I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.

Actions

Copy link

Updated by bmbouter almost 9 years ago

jortel@redhat.com wrote:

I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.

Nice test! +1 to keeping it as it is written in the issue:

/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>

Actions

Copy link

Updated by jcline@redhat.com almost 9 years ago

I don't know how many units people keep in Pulp of a given type, and I don't know what filesystems people use. I do know that ext3 has sub-directory limits of 31998, which is less than 36**3 (I'm assuming a 36 character alphabet for our hash language). I also suspect that performance starts to degrade pretty seriously with large numbers of sub-directories, but without knowing the nitty-gritty details of each filesystem, that's a bit hand-wavy.

My personal opinion is that a prefix of 2 quite sufficient. That's 1296 possible sub-directories (assuming a 36 character alphabet) and if they are evenly distributed we should, on average, maintain sub-directory counts less than or equal to 1296 until well over 1 million units. I don't think an even distribution is necessary guarantee of a hashing algorithm, but it's not hugely damaging if we assume it is will be near to even.

Actions

Copy link

#10

Updated by jortel@redhat.com almost 9 years ago

Using 2 digits seems to produce more favorable results. Running the same test for 100,000 units using hash[0:2] and SHA256 results in creating 256 directories each containing 300-400 subsirectories.

Actions

Copy link

#12

Updated by jortel@redhat.com almost 9 years ago

If only using 2 digits, I'm fine with either.

Actions

Copy link

#13

Updated by mhrivnak almost 9 years ago

+1 to all of this. 2 characters should be enough and is a common approach I've seen in other situations.

One additional motivation: on some filesystems, there can be a real performance impact on file access if a directory listing gets very large. Especially on some older filesystems, the directory data structure is not optimized for random access.

Actions

Copy link

#14