Project

Profile

Help

Task #3051

closed

Prevent Distribution base_path overlap in the data model

Added by mhrivnak over 6 years ago. Updated almost 5 years ago.

Status:
CLOSED - DUPLICATE
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
Start date:
Due date:
% Done:

0%

Estimated time:
Platform Release:
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

This solves the same problem as issue #2987, but it does enforcement at the data model level. It is more complex than the string-parsing proposed in #2987, but enforcing the schema in the database, instead of in python, is a big advantage.

I call this the “Jerry Springer” model. You’ll see why...

I am not recommending this solution over the one proposed in #2987, but it is worth considering. At this early stage of Pulp 3 development, it could make sense to proceed with the quicker solution. At a minimum, this approach could be used as a future replacement for the implementation of #2987.

This solution approaches the issue like a filesystem. To manage the integrity of a filesystem tree, you have to independently represent and track each node in the tree.

We will track each segment of a path as a node and store the full tree in the database. This lets us identify for sure which nodes are leaf nodes. Then the challenge is to enforce that only leaf nodes can have a distribution.

Consider for the moment that we’ll track each node’s “path” and “parent”. For example “a/b/c/”, and a reference to the node for path “a/b/”.

We can accomplish enforcement with two constraints:

  • A node’s path must be unique.
  • A node cannot be a parent and have a distribution. This is another way of expressing “only leaf nodes can have a distribution”.

Of course the challenge with the second constraint is in figuring out if a given node is a parent (Now you’re understanding the name… [0]). On a filesystem, a directory stores references to its children. But in a relational DB, it’s much more desirable to have a FK on the child that references its parent. Given that, is there anything we can do on a node, without querying other nodes, to test whether it is a parent? Yes!

We can add an extra field to the model that is a unique ID, in addition to the PK, and use that as the “to_field” on the ForeignKey field. Let’s call it “willing_parent_id” to signify that a node is willing to be a parent if the field is populated. If that field is null, we are guaranteed that there are no children. That’s half the battle.

The second half is enforcing that either “willing_parent_id” or “distribution” must be null. We can do that at the DB level using the “constraints” feature of our relational database. Django doesn’t fully expose that, but we can use the technique described here:

https://www.fusionbox.com/blog/detail/custom-database-constraints-in-django/594/

The constraint would be approximately:

ALTER TABLE distributed_path ADD CONSTRAINT has_distribution_or_children CHECK (distribution_id IS NULL OR willing_parent_id IS NULL)

The model would look approximately like this (completely untested) code:

class DistributedPath(models.Model):
    path = models.CharField(unique=True)
    willing_parent_id = models.UUIDField(unique=True, null=True, default=uuid.uuid4)
    parent = models.ForeignKey('self', null=True, to_field='willing_parent_id', on_delete=models.PROTECT)
    distribution = models.ForeignKey(Distribution, null=True)

[0] For those not familiar, Jerry Springer is famous in the US for his trashy TV show that's well-known for revealing paternity test results on air.


Related issues

Related to Pulp - Task #2987: The Distribution ViewSet needs to prevent base_path overlap.CLOSED - CURRENTRELEASEdaviddavis

Actions
Related to Pulp - Task #3448: Warn users that distributor base paths should not overlapCLOSED - CURRENTRELEASEdaviddavis

Actions
Related to Pulp - Issue #3449: Requesting content from nested distribution path results in 500 errorCLOSED - WONTFIXActions
Related to Pulp - Story #3044: Distribution create/update operations should be asynchronousCLOSED - CURRENTRELEASECodeHeeler

Actions
Actions #1

Updated by mhrivnak over 6 years ago

  • Related to Task #2987: The Distribution ViewSet needs to prevent base_path overlap. added
Actions #2

Updated by mhrivnak over 6 years ago

  • Description updated (diff)
Actions #3

Updated by jortel@redhat.com over 6 years ago

mhrivnak wrote:

We can add an extra field to the model that is a unique ID, in addition to the PK, and use that as the “to_field” on the ForeignKey field. Let’s call it “willing_parent_id” to signify that a node is willing to be a parent if the field is populated. If that field is null, we are guaranteed that there are no children. That’s half the battle.

The second half is enforcing that either “willing_parent_id” or “distribution” must be null. We can do that at the DB level using the “constraints” feature of our relational database. Django doesn’t fully expose that, but we can use the technique described here:

How is willing_parent_id managed?

Actions #4

Updated by bmbouter over 6 years ago

There is a design that was suggested in #postgresql that I wanted to at least share. It doesn't use constraints so that is not as great, but it is easy to maintain and should be pretty fast because the foreign keys can all be indexed.

class DistributionPath(models.Model):
    path_segment = models.CharField()
    parent = models.ForeignKey('self', null=True, to_field='id')
    distribution = models.ForeignKey(Distribution, null=True)

Then you would run it by searching through the tree to see if there is one that you cannot find one that corresponds with your path segment so far. If you can't find one then it's safe to add. If you get to the end of the segments and you found one all along the way it is not safe to add.

segments = ['a', 'b', 'c']  # this is /a/b/c/
qs = DistributionPath.objects
for index, segment in enumerate(segments):
    try:
        node = qs.get(path_segment=segment)
    except DoesNotExist:
        break  # We know this path is safe
    else:
        qs = DistributionPath.objects.filter(parent=node)
        if index == len(segments):
            raise Exception('this path is not safe')

Adding a Distribution causes these records to be added, probably along with the search. Removing is a similar process. All of this should happen in one table-level transaction which should make this race condition free. This would linearize concurrent distribution additions/removals at the db level.

This kind of a structure could also provide a more efficient lookup of the base paths probably also. That is the purpose of the 'distribution' foreign key.

Actions #5

Updated by daviddavis over 6 years ago

  • Tags Pulp 3 MVP added
Actions #6

Updated by amacdona@redhat.com about 6 years ago

Added checklist item to remove warning from https://pulp.plan.io/issues/3448

Actions #7

Updated by amacdona@redhat.com about 6 years ago

  • Related to Task #3448: Warn users that distributor base paths should not overlap added
Actions #8

Updated by daviddavis about 6 years ago

  • Related to Issue #3449: Requesting content from nested distribution path results in 500 error added
Actions #9

Updated by bmbouter about 6 years ago

I wanted to write an idea for an algorithmic solution here. It would use a model like this one (totally untested):

class BasePath(Model):
    parent_path = ForeignKey('BasePath', null=True, backref='subpaths')
    distribution = ForeignKey('Distribution', null=True)
    subpath = CharField(max_length=128)
    distribution_count = IntegerField(default=1)

Then have an algorithm similar to this loose code:

def add_path(path, distribution):
    """
    Takes the path as a string and the models.Distribution instance this path is for.
    """
    list_of_path_subcomponents = os.split(os.sep, path)
    # start db transaction
    parent = None
    for i, subpath in enumerate(list_of_path_subcomponents):
        try:
            parent = BasePath.objects.filter(parent=parent, subpath=subpath).get()
        except DoesNotExist:
            distribution_kwargs = {}
            if i == len(list_of_path_subcomponents) - 1:
                # The next BasePath needs to have the Distribution set so that the data model is correct
                distribution_kwargs['distribution'] = distribution
            parent = BasePath.objects.create(parent=parent, subpath=subpath, **distribution_kwargs)
        continue:
            if parent.distribution != None:
                # This path is already in use by a distribution so raise an exception
                raise Exception('The path overlaps! Revert all changes by raising an exception')
            if i == len(list_of_path_subcomponents) - 1:
                # This is expected to be the distribution for this path, but it's already in use. Raise an exception
                raise Exception('The path overlaps! Revert all changes by raising an exception')
            parent.distribution_count = parent.distribution_count + 1
            parent.save()
    # commit db transaction
def remove_path(path):
    """
    Takes the path as a string to be removed
    """
    list_of_path_subcomponents = os.split(os.sep, path)
    # start db transaction
    parent = None
    for i, subpath in enumerate(list_of_path_subcomponents):
        parent = BasePath.objects.filter(parent=parent, subpath=subpath).get()
        if parent.distribution_count == 1:
           parent.delete()
           parent = None  # This causes the next loop to find those entries whose ForeignKeys were set to None when the delete() ran above
        else:
            parent.distribution_count = parent.distribution_count - 1
            parent.distribution_count.save()
    # commit db transaction

The idea is that this would run quickly due to the indexed foreign keys that Django will create. It really restricts the number of records searched at each level of the search. I don't see how we can not search each level for overlap independently so searching efficiently level-by-level is as good as we can do.

Also the transactions make adding easy because you can modify records until you find a problem and then bail with an exception, and the record writes are all ignored which is what you want. Similarly for removing a base path. Say you have a path that is partially right in the first parts, but not accurate in the later parts. You can start removing it in the transaction, and then when you later learn it's invalid you can bail with the exception.

The transaction also ensures safety even with concurrency. In practice I think these transactions will be held for a very small amount of time.

I'm thinking this would not be used by the content app, but would allow us to facilitate this requirement using the database and the algorithm. I'm not sure what the content app looks like currently in terms of how it resolves the base_path and how this is similar (or not) to that.

Actions #10

Updated by amacdona@redhat.com about 6 years ago

  • Tags deleted (Pulp 3 MVP)
Actions #11

Updated by daviddavis about 6 years ago

Another option is to use the tasking system. Currently tasks lock on the 'resource' field[0] which is just a string (e.g. /api/v3/distributions/abc123/). The reservation system relies on the uniqueness constraint of this field to prevent two tasks from reserving the same resource. We could lock on all distributions by using /api/v3/distributions/ when we create/update distributions. We would have to be careful though not to lock on individual distributions in other places in the code.

[0] https://git.io/vpIEq

Actions #12

Updated by amacdona@redhat.com about 6 years ago

I really hope that we don't do a task lock for distribution updates.

One of the big values for the current non-locking distribution update design was that "promotion" would be instant. IMO, this feature is really important, it might even outweigh the relatively unlikely race condition.

Actions #13

Updated by amacdona@redhat.com about 6 years ago

To elaborate, I'll tell a user story.

The user has a "testing" and "production" distribution, and "testing" is currently serving a newer publication than "production". The tests all pass, and they set "production" to distribute the newer publication. Unfortunately, their tests missed something, and they need to roll back. The situation is made even worse, Pulp is currently doing the regular syncs of many repositories at once and all the workers are busy.

Either:
The user makes a synchronous call, rolling back "production" to the old publication. Solved! The admin can go home.
of
The user makes an asynchronous call, but it is low on the task queue. "production" is still serving the incorrect content until the task queue dies down, so the user cancels all the current tasks out of desperation. The distribution is updated, whew! Then the user queues up all the tasks they canceled and goes home.

Actions #14

Updated by daviddavis about 6 years ago

@asmacdo, I mention the task option because it requires very little change to our code today—perhaps only a few lines. All these other solutions are complex and require db changes that can be hard to grok. A synchronous call to updating base_path on distribution is better for users for sure.

Actions #15

Updated by amacdona@redhat.com over 5 years ago

  • Related to Story #3044: Distribution create/update operations should be asynchronous added
Actions #16

Updated by amacdona@redhat.com over 5 years ago

  • Tags Pulp 3 RC Blocker added
Actions #17

Updated by jortel@redhat.com over 5 years ago

  • Status changed from NEW to CLOSED - DUPLICATE

This will be resolved by #3044.

Actions #18

Updated by daviddavis almost 5 years ago

  • Sprint/Milestone set to 3.0.0
Actions #19

Updated by bmbouter almost 5 years ago

  • Tags deleted (Pulp 3, Pulp 3 RC Blocker)

Also available in: Atom PDF