Task #1600
closedStore content using consistent and deterministic paths
Added by jortel@redhat.com almost 9 years ago. Updated over 5 years ago.
100%
Description
Store content using consistent and deterministic paths. In 2.8, the platform determines the _storage_path (with some help from plugins). It determines the path using the UUID as:
/var/lib/pulp/content/units/<type_id>/<unit_id>[0:4]/<unit_id>/<file>
To support recovery scenarios, using the hash of unit_key instead of unit_id will be required. This makes the storage path deterministic.
Suggested:
/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>
Related issues
Updated by bmbouter almost 9 years ago
In the suggestion, when you say hash do you mean <hash>?
Also can you remind me again what motivates the hash[0:4] part of the storage path? I figured the layout format would be:
/var/lib/pulp/content/units/<type_id>/<hash>/<file>
Updated by jortel@redhat.com almost 9 years ago
The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.
Updated by bmbouter almost 9 years ago
- Description updated (diff)
jortel@redhat.com wrote:
The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.
hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.
I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?
Updated by jortel@redhat.com almost 9 years ago
bmbouter wrote:
jortel@redhat.com wrote:
The term hash in the path refers to the SHA256 hex digest of the unit key. Perhaps digest would be more accurate. The motivation for hash[0:4]/hash was to reduce the possibility of exceeding the maximum number of sub-directories in the units/.
hash is a fine term to use. I was more clarifying if you mean hash as a value which is different for each unit or the string 'hash'. Usually the changing values have <> around them. I just added brackets to the issue description.
I don't think /<hash>[0:4]/<hash>/ is helping us avoid running out of too many files in a directory any more than just using /<hash>/ instead because there is a 1:1 correspondence between /<hash>[0:4]/ and /<hash>/. Am I thinking about this right?
The idea comes from how ostree stores files in objects/. I think the expectation is that given enough values, that the 1st 4 digits of the hash would duplicate sufficiently to build additional directories. Although, now that you ask, I think the probability of this is worth investigating. This really should be: units/type_id/<hash>[0:4]/<hash>[4:]/<file>
For example: hash values (shortened for illustration):
123401231
1234A1232
1234B1233
123501234
1235A1235
1235B1236
475649487
994049858
Would produce this tree:
1234/
01231/
A1232/
B1232/
1235/
01234/
A1235/
B1236/
4756/
49487/
9940/
49858/
Updated by bmbouter almost 9 years ago
Oh that is interesting. I had not considered that.
I expect the hashing algorithm to provide good randomness in terms of the first 4 characters produced. Assuming the hash algorithm fills the possible combinations evenly and only allowing characters [A-Z] and [0-9], we would expect duplicates after 36^4 units are in the database. That's 1,679,616 units.
Given that, I propose:
/var/lib/pulp/content/units/<type_id>/<hash>/<file>
What do you think given all of this?
Updated by bmbouter almost 9 years ago
- Blocks Story #236: Don't re-download rpms if they exist on disk added
Updated by jortel@redhat.com almost 9 years ago
I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.
Updated by bmbouter almost 9 years ago
jortel@redhat.com wrote:
I did a quick test which generated 100,000 hashes using sha256 on unique strings. This produced enough duplicates of hash[0:4] to create 29,000 directories each containing 2-5 items. This seems to support that there is value if this approach.
Nice test! +1 to keeping it as it is written in the issue:
/var/lib/pulp/content/units/<type_id>/<hash>[0:4]/<hash>/<file>
Updated by jcline@redhat.com almost 9 years ago
I don't know how many units people keep in Pulp of a given type, and I don't know what filesystems people use. I do know that ext3 has sub-directory limits of 31998, which is less than 36**3 (I'm assuming a 36 character alphabet for our hash language). I also suspect that performance starts to degrade pretty seriously with large numbers of sub-directories, but without knowing the nitty-gritty details of each filesystem, that's a bit hand-wavy.
My personal opinion is that a prefix of 2 quite sufficient. That's 1296 possible sub-directories (assuming a 36 character alphabet) and if they are evenly distributed we should, on average, maintain sub-directory counts less than or equal to 1296 until well over 1 million units. I don't think an even distribution is necessary guarantee of a hashing algorithm, but it's not hugely damaging if we assume it is will be near to even.
Updated by jortel@redhat.com almost 9 years ago
Using 2 digits seems to produce more favorable results. Running the same test for 100,000 units using hash[0:2] and SHA256 results in creating 256 directories each containing 300-400 subsirectories.
Updated by jortel@redhat.com almost 9 years ago
If only using 2 digits, I'm fine with either.
Updated by mhrivnak almost 9 years ago
+1 to all of this. 2 characters should be enough and is a common approach I've seen in other situations.
One additional motivation: on some filesystems, there can be a real performance impact on file access if a directory listing gets very large. Especially on some older filesystems, the directory data structure is not optimized for random access.
Updated by jortel@redhat.com almost 9 years ago
- Status changed from NEW to ASSIGNED
Updated by jortel@redhat.com almost 9 years ago
- Status changed from ASSIGNED to POST
Added by jortel@redhat.com almost 9 years ago
Added by jortel@redhat.com almost 9 years ago
Revision ec3fafac | View on GitHub
Unit storage path based on sha256 of unit key instead of using UUID. closes #1600
Updated by jortel@redhat.com almost 9 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|ec3fafac36510e6d9216afbe22a9645f5f2c6e8d.
Updated by dkliban@redhat.com almost 9 years ago
- Status changed from MODIFIED to 5
Updated by dkliban@redhat.com almost 9 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Unit storage path based on sha256 of unit key instead of using UUID. closes #1600