Task #2987: The Distribution ViewSet needs to prevent base_path overlap. - Pulp

Actions

Send by e-mail Copy link

Task #2987

closed

The Distribution ViewSet needs to prevent base_path overlap.

Added by jortel@redhat.com over 7 years ago. Updated about 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

Category:

-

Sprint/Milestone:

Start date:

Due date:

% Done:

100%

Estimated time:

Platform Release:

Groomed:

No

Sprint Candidate:

Yes

Tags:

Sprint:

Sprint 35

Quarter:

Description

The Distribution ViewSet needs to previent base_path overlap. The model (DB) ensures base_path uniqueness but does not prevent one base_path from being nested within another. The look-and-see logic in the pulp2 check for this is subject to race conditions. This needs to be solved using the DB.

For example:

base_path = a/b/c/d needs to reserve the entire directory tree. Another base_path cannot begin with a/b/c/d.

Related issues

Actions

#1

Updated by jortel@redhat.com over 7 years ago

Description updated (diff)

Actions

#2

Updated by jortel@redhat.com over 7 years ago

Subject changed from The Distribution serializer needs to previent base_path overlap. to The Distribution ViewSet needs to previent base_path overlap.
Description updated (diff)

Actions

#3

Updated by amacdona@redhat.com over 7 years ago

Groomed changed from No to Yes
Sprint Candidate changed from No to Yes

Actions

#4

Updated by mhrivnak over 7 years ago

Sprint/Milestone set to 45

Actions

#5

Updated by jortel@redhat.com over 7 years ago

Groomed changed from Yes to No

Ungrooming and removing checklist items per discussion in retrospective.

Actions

#6

Updated by jortel@redhat.com over 7 years ago

Sprint/Milestone deleted (45)

Let's document and discuss potential solutions here.

Actions

#7

Updated by mhrivnak over 7 years ago

Thinking of this as two problems, I have a pattern to solve one of them. Maybe it'll inspire someone to come up with another idea that solves both. The problems are:

1. make sure the new base_path is not contained by some existing base_path
2. make sure the new base_path does not contain some existing base_path

Problem 1 is very similar to a problem faced by the app that serves content. Given a full URL, it needs to determine which distribution's base_path is somewhere in that URL.

Given a candidate new base_path or a full path to content, for example "a/b/c/d", you could split that into each of these paths:

a
a/b
a/b/c
a/b/c/d

And you could do a query for any Distribution where the base_path is in that list of potential paths.

For use case 1, if a Distribution is found, then the new base_path is not allowed. For the content serving use case, if a Distribution is found, then that is the one which should be further queried to find the requested file.

But as you can see, this does nothing for use case 2.

Actions

#8

Updated by jortel@redhat.com over 7 years ago

An additional consideration is race conditions which can be solved in a few ways.

1. The check is enforced with constraints so the DB will prevent race conditions. Not sure how this can be done yet.
2. The check is enforced with queries (as done in pulp2 and suggested in #note-7) in which case a common approach is to lock the table. Best I can tell, explicit table locking is not directly supported by django in a DB agnostic way. However this can be solved easily if we're willing to do a little bit of postgres specific SQL.

As for case 2, there won't be many Distributions and I think we could fetch them all (just base_url) into memory and apply the case 1 algorithm in reverse.

Actions

#9

Updated by bmbouter over 7 years ago

Can some example cases that we want to prevent be written out? Having some common examples would be helpful I think in evaluating algorithms.

Actions

#10

Updated by dkliban@redhat.com over 7 years ago

mhrivnak, I think you meant that searching for sub-paths of a path would help with use case 2. For use case number 1, a simple LIKE with a left-anchored pattern would be sufficient. An index is used for such a lookup in both PostgreSQL and MariaDB. Continuing with your example, the query would look like this:

SELECT * FROM distributions
WHERE base_path LIKE 'a/b/c/d%'

Actions

#11

Updated by amacdona@redhat.com over 7 years ago

bmbouter, I'm not sure if you meant realistic use cases, here is the abstract, general form.

d1.base_path = a/b/c/d/

When d2 is created, the folowing base_paths should violate this constraint:

a, a/b/, a/b/c/, and a/b/c/d/ because d2 could pollute d1.base_path. (mhrivnak's case 1)
`a/b/c/d/*` because d1` could pollute d2.base_path (mhrivnak's case 2)

Note: d2.base_path = a/b/not_c/d/ is still legal, even though both base_paths have reserved a/ and a/b/

Brainstorming:

When a base_path is saved, each path that could pollute this base_path can thought of as reserved.
A base_path cannot contain the strings that match ^any_other_base_path.+$
A base_path cannot be in all_reserved_paths
A new valid base path can contain ^any_reserved_path.+$

Since we can't treat reserved_paths the same as base_paths, I think we should consider a separate table for reserved_paths. With 2 tables, (and BasePath inherits from RelativePath) I'm hoping/speculating that we could validate with a query like this:

def validate_base_path:
    reserved_collisions = models.RelativePaths.objects(path=base_path)
    nested_collisions = models.BasePaths.objects(path__startswith=base_path)
    assert not reserved_collisions.exists() and not nested_collisions.exists()

Actions

#12

Updated by mhrivnak over 7 years ago

This covers the most prominent cases I can think of. It includes 1 and 2 from https://pulp.plan.io/issues/2987#note-7, plus it includes a potential string matching gotcha.

Candidate	Existing	Is OK
a/b/c	a/b/d	Y
a/b/c	a/b/charlie	Y
a/b/c	a/b	N
a/b/c	a/b/c	N
a/b/c	a/b/c/d	N

Actions

#13

Updated by mhrivnak over 7 years ago

To pile on one more way of thinking about it, since I'm weary of doing string comparison or parsing while honoring path semantics (though I wouldn't say it's out of the question), I think of this in terms of path segments organized in a tree.

Each node is a path segment.
Each non-root node has one parent and 0 to many children.
A Distribution can only exist at a leaf node.
Every leaf node contains a Distribution.

Actions

#14

Updated by amacdona@redhat.com over 7 years ago

mhrivnak wrote:

This covers the most prominent cases I can think of. It includes 1 and 2 from https://pulp.plan.io/issues/2987#note-7, plus it includes a potential string matching gotcha.

Candidate Existing Is OK

a/b/c a/b/d Y

a/b/c a/b/charlie Y

a/b/c a/b N

a/b/c a/b/c N

a/b/c a/b/c/d N

We can avoid the string match gotcha if each path segment always includes its trailing slash.

Actions

#15

Updated by dkliban@redhat.com over 7 years ago

mhrivnak, why are you opposed to the string matching? PostgreSQL can use indexes[0] to do this.

[0] https://www.postgresql.org/docs/9.5/static/indexes-opclass.html

Actions

#16

Updated by mhrivnak over 7 years ago

I'm not opposed. String parsing the whole path might be our best option. It's just not often the best starting point for a tree traversal problem, and there are some potential gotchas we need to be careful with. So I gravitate toward representing the hierarchy more concretely, and as jortel was describing, it would be ideal to enforce data validation within the data model itself. But I'm very open to any solution that gets the job done.

If we normalize the paths (I think we have to regardless of how we store the path data), and if we ensure the paths always end in a trailing slash (probably safe but we need to enforce it in code), and if the database can do such searches quickly, then I think it's a viable option.

Normalization might be a little tricky, although we have to do it regardless. As long as we normalize paths before saving them to the DB and normalize the "candidate" path being used in each query, we should be ok. Here are some normalization details we have to keep in mind:

https://tools.ietf.org/html/rfc3986#page-40

Surprisingly, it looks like python does not have a standard library function to do this normalization. :( The big items are case normalization and collapsing any dot segments or multiple slashes. Maybe django has something built in that can do this for us.

So given what we have so far, I like the simplicity of the string matching approach, and I don't see a more compelling option. But if we can come up with an approach that enforces these constraints in the data model, that would also be very interesting to consider.

Actions

#17

Updated by dkliban@redhat.com over 7 years ago

I think that python does have a library[0] that normalizes paths. Is it not sufficient?

Linux filesystems are case sensitive so we don't need to worry about case normalization.

I have not found a way to get the database to perform this kind of constraint for us.

[0] https://docs.python.org/3/library/os.path.html#os.path.normpath

Actions

#18

Updated by mhrivnak over 7 years ago

os.path is specifically for working with filesystem paths, which do behave a bit differently from URI paths. For example, URI paths can include percent-encoded data that is case-insensitive, while the rest of the path remains case sensitive.

Also, the behavior you get from os.path is platform-dependent. That's generally good when you're working with filesystem paths, because you want that tied to the behavior of your os (hence the package name of course). It's not so good when you want to work with RFC 3986 paths. We could just use "posixpath.normpath()" to get the same behavior reliably, but we would still be limited by the fact that it's not designed for URI paths.

We could decide to store the paths fully-unquoted, which eliminates the case sensitivity problem, and then still use posixpath for separators and dot-segments. Something like this might be sufficient for our normalization, but needs careful review to make sure we covered everything:

import posixpath
import urllib.parse

def normalize(input):
  return posixpath.normpath(urllib.parse.unquote(input)) + '/'

Actions

#19

Updated by dkliban@redhat.com over 7 years ago

mhrivnak, i like that implementation <insert a thumbs up emoji>

Actions

#20

Updated by amacdona@redhat.com over 7 years ago

Subject changed from The Distribution ViewSet needs to previent base_path overlap. to The Distribution ViewSet needs to prevent base_path overlap.

Actions

#21

Updated by dkliban@redhat.com over 7 years ago

Just saw the Django docs[0] about the index we need for performing LIKE querries on base_path.

[0] https://docs.djangoproject.com/en/1.11/ref/databases/#indexes-for-varchar-and-text-columns

Actions

#22

Updated by mhrivnak over 7 years ago

Related to Task #3051: Prevent Distribution base_path overlap in the data model added

Actions

#23

Updated by mhrivnak over 7 years ago

I filed #3051 as a way to solve this with enforcement at the database level. It seemed worth tracking as a separate redmine issue, and then we can choose which to implement.

Actions

#24

Updated by daviddavis about 7 years ago

Tags Pulp 3 MVP added
Tags deleted (~~Pulp 3~~)

Actions

#25

Updated by bmbouter about 7 years ago

I wanted to write an idea for an algorithmic solution here. It would use a model like this one (totally untested):

class BasePath(Model):
    parent_path = ForeignKey('BasePath', null=True, backref='subpaths')
    distribution = ForeignKey('Distribution', null=True)
    subpath = CharField(max_length=128)
    distribution_count = IntegerField(default=1)

Then have an algorithm similar to this loose code:

def add_path(path, distribution):
    """
    Takes the path as a string and the models.Distribution instance this path is for.
    """
    list_of_path_subcomponents = os.split(os.sep, path)
    # start db transaction
    parent = None
    for i, subpath in enumerate(list_of_path_subcomponents):
        try:
            parent = BasePath.objects.filter(parent=parent, subpath=subpath).get()
        except DoesNotExist:
            distribution_kwargs = {}
            if i == len(list_of_path_subcomponents) - 1:
                # The next BasePath needs to have the Distribution set so that the data model is correct
                distribution_kwargs['distribution'] = distribution
            parent = BasePath.objects.create(parent=parent, subpath=subpath, **distribution_kwargs)
        continue:
            if parent.distribution != None:
                # This path is already in use by a distribution so raise an exception
                raise Exception('The path overlaps! Revert all changes by raising an exception')
            if i == len(list_of_path_subcomponents) - 1:
                # This is expected to be the distribution for this path, but it's already in use. Raise an exception
                raise Exception('The path overlaps! Revert all changes by raising an exception')
            parent.distribution_count = parent.distribution_count + 1
            parent.save()
    # commit db transaction

def remove_path(path):
    """
    Takes the path as a string to be removed
    """
    list_of_path_subcomponents = os.split(os.sep, path)
    # start db transaction
    parent = None
    for i, subpath in enumerate(list_of_path_subcomponents):
        parent = BasePath.objects.filter(parent=parent, subpath=subpath).get()
        if parent.distribution_count == 1:
           parent.delete()
           parent = None  # This causes the next loop to find those entries whose ForeignKeys were set to None when the delete() ran above
        else:
            parent.distribution_count = parent.distribution_count - 1
            parent.distribution_count.save()
    # commit db transaction

The idea is that this would run quickly due to the indexed foreign keys that Django will create. It really restricts the number of records searched at each level of the search. I don't see how we can not search each level for overlap independently so searching efficiently level-by-level is as good as we can do.

Also the transactions make adding easy because you can modify records until you find a problem and then bail with an exception, and the record writes are all ignored which is what you want. Similarly for removing a base path. Say you have a path that is partially right in the first parts, but not accurate in the later parts. You can start removing it in the transaction, and then when you later learn it's invalid you can bail with the exception.

The transaction also ensures safety even with concurrency. In practice I think these transactions will be held for a very small amount of time.

I'm thinking this would not be used by the content app, but would allow us to facilitate this requirement using the database and the algorithm. I'm not sure what the content app looks like currently in terms of how it resolves the base_path and how this is similar (or not) to that.

Actions

#26

Updated by daviddavis about 7 years ago

Tags Pulp 3 added

Actions

#27

Updated by jortel@redhat.com almost 7 years ago

I totally get the motivation behind leveraging the DB for this. It was my first thought as well. But, all of the algorithms suggested (so far) seem complicated and store data that is only needed to support this check in the DB. It makes the base_path in our data model complicated. I'm wondering if we should explore another great tool at our disposal - resource locking. The path overlapping check itself (like in pulp2) can be implemented using a straight forward algorithm that does not impact the data-model. The problem is, as we all know, is that it's vulnerable to race conditions which is how solving using the DB got started. So, what if we just add/update Distribution in a task that is resource locked on the Distribution and have the task implement the check?

Another option to consider for solving the concurrency problem is explicit table locking. This would be safe and simpler than resource locking (tasking) especially with a context manager. But, sadly, explicit table locking is not supported by django ORM which is a huge downside. The SQL is simple enough but not 100% standard and could limit our ability to be DB agnostic. Just mentioning as "food for thought".

Actions

#28

Updated by daviddavis almost 7 years ago

Status changed from NEW to POST
Assignee set to daviddavis

https://github.com/pulp/pulp/pull/3380

Actions

#29

Updated by bmbouter almost 7 years ago

I copied the design from comment 25 to ticket #3051 that intends to do this at the DB layer. Specifically it was moved here: https://pulp.plan.io/issues/3051#note-9

Actions

#30

Updated by daviddavis almost 7 years ago

Sprint set to Sprint 35

Consensus was to add this to the sprint during today's core beta check-in meeting.

Added by daviddavis almost 7 years ago

Revision 4fc05cf4 | View on GitHub

Add view layer logic to prevent overlapping distribution paths

fixes #2987 https://pulp.plan.io/issues/2987

Added by daviddavis almost 7 years ago

Revision 4fc05cf4 | View on GitHub

Add view layer logic to prevent overlapping distribution paths

fixes #2987 https://pulp.plan.io/issues/2987

Actions

#32

Updated by daviddavis almost 7 years ago

Status changed from POST to MODIFIED
% Done changed from 0 to 100

Applied in changeset pulp|4fc05cf4517f53f3877ed451c3dc66dc271c66cc.

Actions

#33

Updated by dkliban@redhat.com almost 7 years ago

Sprint/Milestone set to 3.0.0

Actions

#34

Updated by bmbouter over 5 years ago

Tags deleted (~~Pulp 3, Pulp 3 MVP~~)

Actions

#35

Updated by kersom over 5 years ago

Related to Test #4882: Add back tests related to overlapping base_path added

Actions

#36

Updated by bmbouter about 5 years ago

Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Actions

Send by e-mail Copy link

Also available in: Atom PDF