Project

Profile

Help

Task #2987

The Distribution ViewSet needs to prevent base_path overlap.

Added by jortel@redhat.com 3 months ago. Updated about 2 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
% Done:

0%

Platform Release:
Blocks Release:
Backwards Incompatible:
No
Groomed:
No
Sprint Candidate:
Yes
Tags:
Pulp 3
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No

Description

The Distribution ViewSet needs to previent base_path overlap. The model (DB) ensures base_path uniqueness but does not prevent one base_path from being nested within another. The look-and-see logic in the pulp2 check for this is subject to race conditions. This needs to be solved using the DB.

For example:

base_path = a/b/c/d needs to reserve the entire directory tree. Another base_path cannot begin with a/b/c/d.


Related issues

Related to Pulp - Task #3051: Prevent Distribution base_path overlap in the data model NEW Actions

History

#1 Updated by jortel@redhat.com 3 months ago

  • Description updated (diff)

#2 Updated by jortel@redhat.com 3 months ago

  • Subject changed from The Distribution serializer needs to previent base_path overlap. to The Distribution ViewSet needs to previent base_path overlap.
  • Description updated (diff)

#3 Updated by amacdona@redhat.com 2 months ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

#4 Updated by mhrivnak 2 months ago

  • Sprint/Milestone set to Sprint 26

#5 Updated by jortel@redhat.com about 2 months ago

  • Groomed changed from Yes to No

Ungrooming and removing checklist items per discussion in retrospective.

#6 Updated by jortel@redhat.com about 2 months ago

  • Sprint/Milestone deleted (Sprint 26)

Let's document and discuss potential solutions here.

#7 Updated by mhrivnak about 2 months ago

Thinking of this as two problems, I have a pattern to solve one of them. Maybe it'll inspire someone to come up with another idea that solves both. The problems are:

1. make sure the new base_path is not contained by some existing base_path
2. make sure the new base_path does not contain some existing base_path

Problem 1 is very similar to a problem faced by the app that serves content. Given a full URL, it needs to determine which distribution's base_path is somewhere in that URL.

Given a candidate new base_path or a full path to content, for example "a/b/c/d", you could split that into each of these paths:

  • a
  • a/b
  • a/b/c
  • a/b/c/d

And you could do a query for any Distribution where the base_path is in that list of potential paths.

For use case 1, if a Distribution is found, then the new base_path is not allowed. For the content serving use case, if a Distribution is found, then that is the one which should be further queried to find the requested file.

But as you can see, this does nothing for use case 2.

#8 Updated by jortel@redhat.com about 2 months ago

An additional consideration is race conditions which can be solved in a few ways.

1. The check is enforced with constraints so the DB will prevent race conditions. Not sure how this can be done yet.
2. The check is enforced with queries (as done in pulp2 and suggested in #note-7) in which case a common approach is to lock the table. Best I can tell, explicit table locking is not directly supported by django in a DB agnostic way. However this can be solved easily if we're willing to do a little bit of postgres specific SQL.

As for case 2, there won't be many Distributions and I think we could fetch them all (just base_url) into memory and apply the case 1 algorithm in reverse.

#9 Updated by bmbouter about 2 months ago

Can some example cases that we want to prevent be written out? Having some common examples would be helpful I think in evaluating algorithms.

#10 Updated by dkliban@redhat.com about 2 months ago

@mhrivnak, I think you meant that searching for sub-paths of a path would help with use case 2. For use case number 1, a simple LIKE with a left-anchored pattern would be sufficient. An index is used for such a lookup in both PostgreSQL and MariaDB. Continuing with your example, the query would look like this:

SELECT * FROM distributions
WHERE base_path LIKE 'a/b/c/d%'  

#11 Updated by amacdona@redhat.com about 2 months ago

@bmbouter, I'm not sure if you meant realistic use cases, here is the abstract, general form.

d1.base_path = a/b/c/d/

When d2 is created, the folowing base_paths should violate this constraint:
  • a, a/b/, a/b/c/, and a/b/c/d/ because d2 could pollute d1.base_path. (mhrivnak's case 1)
  • `a/b/c/d/*` because d1` could pollute d2.base_path (mhrivnak's case 2)

Note: d2.base_path = a/b/not_c/d/ is still legal, even though both base_paths have reserved a/ and a/b/

Brainstorming:

  • When a base_path is saved, each path that could pollute this base_path can thought of as reserved.
  • A base_path cannot contain the strings that match ^any_other_base_path.+$
  • A base_path cannot be in all_reserved_paths
  • A new valid base path can contain ^any_reserved_path.+$

Since we can't treat reserved_paths the same as base_paths, I think we should consider a separate table for reserved_paths. With 2 tables, (and BasePath inherits from RelativePath) I'm hoping/speculating that we could validate with a query like this:

def validate_base_path:
    reserved_collisions = models.RelativePaths.objects(path=base_path)
    nested_collisions = models.BasePaths.objects(path__startswith=base_path)
    assert not reserved_collisions.exists() and not nested_collisions.exists()

#12 Updated by mhrivnak about 2 months ago

This covers the most prominent cases I can think of. It includes 1 and 2 from https://pulp.plan.io/issues/2987#note-7, plus it includes a potential string matching gotcha.

Candidate Existing Is OK
a/b/c a/b/d Y
a/b/c a/b/charlie Y
a/b/c a/b N
a/b/c a/b/c N
a/b/c a/b/c/d N

#13 Updated by mhrivnak about 2 months ago

To pile on one more way of thinking about it, since I'm weary of doing string comparison or parsing while honoring path semantics (though I wouldn't say it's out of the question), I think of this in terms of path segments organized in a tree.

Each node is a path segment.
Each non-root node has one parent and 0 to many children.
A Distribution can only exist at a leaf node.
Every leaf node contains a Distribution.

#14 Updated by amacdona@redhat.com about 2 months ago

mhrivnak wrote:

This covers the most prominent cases I can think of. It includes 1 and 2 from https://pulp.plan.io/issues/2987#note-7, plus it includes a potential string matching gotcha.

Candidate Existing Is OK
a/b/c a/b/d Y
a/b/c a/b/charlie Y
a/b/c a/b N
a/b/c a/b/c N
a/b/c a/b/c/d N

We can avoid the string match gotcha if each path segment always includes its trailing slash.

#15 Updated by dkliban@redhat.com about 2 months ago

@mhrivnak, why are you opposed to the string matching? PostgreSQL can use indexes0 to do this.

[0] https://www.postgresql.org/docs/9.5/static/indexes-opclass.html

#16 Updated by mhrivnak about 2 months ago

I'm not opposed. String parsing the whole path might be our best option. It's just not often the best starting point for a tree traversal problem, and there are some potential gotchas we need to be careful with. So I gravitate toward representing the hierarchy more concretely, and as jortel was describing, it would be ideal to enforce data validation within the data model itself. But I'm very open to any solution that gets the job done.

If we normalize the paths (I think we have to regardless of how we store the path data), and if we ensure the paths always end in a trailing slash (probably safe but we need to enforce it in code), and if the database can do such searches quickly, then I think it's a viable option.

Normalization might be a little tricky, although we have to do it regardless. As long as we normalize paths before saving them to the DB and normalize the "candidate" path being used in each query, we should be ok. Here are some normalization details we have to keep in mind:

https://tools.ietf.org/html/rfc3986#page-40

Surprisingly, it looks like python does not have a standard library function to do this normalization. :( The big items are case normalization and collapsing any dot segments or multiple slashes. Maybe django has something built in that can do this for us.

So given what we have so far, I like the simplicity of the string matching approach, and I don't see a more compelling option. But if we can come up with an approach that enforces these constraints in the data model, that would also be very interesting to consider.

#17 Updated by dkliban@redhat.com about 2 months ago

I think that python does have a library0 that normalizes paths. Is it not sufficient?

Linux filesystems are case sensitive so we don't need to worry about case normalization.

I have not found a way to get the database to perform this kind of constraint for us.

[0] https://docs.python.org/3/library/os.path.html#os.path.normpath

#18 Updated by mhrivnak about 2 months ago

os.path is specifically for working with filesystem paths, which do behave a bit differently from URI paths. For example, URI paths can include percent-encoded data that is case-insensitive, while the rest of the path remains case sensitive.

Also, the behavior you get from os.path is platform-dependent. That's generally good when you're working with filesystem paths, because you want that tied to the behavior of your os (hence the package name of course). It's not so good when you want to work with RFC 3986 paths. We could just use "posixpath.normpath()" to get the same behavior reliably, but we would still be limited by the fact that it's not designed for URI paths.

We could decide to store the paths fully-unquoted, which eliminates the case sensitivity problem, and then still use posixpath for separators and dot-segments. Something like this might be sufficient for our normalization, but needs careful review to make sure we covered everything:

import posixpath
import urllib.parse

def normalize(input):
  return posixpath.normpath(urllib.parse.unquote(input)) + '/'

#19 Updated by dkliban@redhat.com about 2 months ago

@mhrivnak, i like that implementation <insert a thumbs up emoji>

#20 Updated by amacdona@redhat.com about 2 months ago

  • Subject changed from The Distribution ViewSet needs to previent base_path overlap. to The Distribution ViewSet needs to prevent base_path overlap.

#21 Updated by dkliban@redhat.com about 2 months ago

Just saw the Django docs0 about the index we need for performing LIKE querries on base_path.

[0] https://docs.djangoproject.com/en/1.11/ref/databases/#indexes-for-varchar-and-text-columns

#22 Updated by mhrivnak about 2 months ago

  • Related to Task #3051: Prevent Distribution base_path overlap in the data model added

#23 Updated by mhrivnak about 2 months ago

I filed #3051 as a way to solve this with enforcement at the database level. It seemed worth tracking as a separate redmine issue, and then we can choose which to implement.

Please register to edit this issue

Also available in: Atom PDF