Story #3186
closedStory #3209: As a user, I have Repository Versions
Rewrite tasking system labels for versioned repositories
100%
Description
Motivation¶
Versioned repositories will have significant impact on the locking system. Due to the bookkeeping approach, anytime you modify content or a RepositoryVersion, you need to serialize those operations with other content manipulations on that repo.
Also, the decision to not associate an importer directly with a repo means that we need to lock differently when updating importer settings.
Finally, publication manipulation also needs to lock on repository versions.
Design¶
The existing tasking system prevents two tasks from running at the same time that have the same lock label. For correctness, I think we will need to introduce multi-resource locking, which is the ability to have one task require lock N resources. This is easily added to the tasking system. The resource locked and in what situation are listed here:
- an update/delete importer task locks on the importer instance
- an update/delete publisher task locks on the publisher instance
- a sync task locks on the importer instance
- a sync task locks on the repository
- a publish task locks on the publisher instance
- a publish task locks on the repository
- a delete versioned repo task locks on the repository
A three task example¶
Say a user updates an importer with a PUT to /api/v3/importers/file-importer/55. That would cause the update task to lock on just that importer ["/api/v3/importers/file-importer/55"]
Then a user will POST to /api/v3/repository/foo/repo_version/ with {"importer": "/api/v3/importers/file-importer/55"}
That task would multi-lock on the importer, the repo version, and the repository : ["/api/v3/importers/file-importer/55", "/api/v3/repository/foo/repo_version/72", "/api/v3/repository/foo/"] <---- Note that the viewset has to create the RepoVersion so it can know the detail view to lock on
Then a user will DELETE /api/v3/repository/foo/repo_version/72 which will cause a lock of the repo version and the repository itself ["/api/v3/repository/foo/repo_version/72", "/api/v3/repository/foo/"] to be used.
Example Run Order¶
Even on a Pulp system with 3+ workers, the above 3 tasks will be executed fully serially because each one requires a resource locked by the one previous.
Updated by jortel@redhat.com about 7 years ago
Thanks for the proposal, Brian!
Can you elaborate on the actual design? Specific changes to the tasking system?
In classic locking cases (not sure how task locking works) acquiring locks in random order can result in deadlock. There are several ways to prevent/resolve deadlock. The most common (and simplest) is to acquire locks in a pre-determined order.
Can you explain how the proposed design address deadlock?
Updated by bmbouter about 7 years ago
The deadlocking is something we need to be sure of. I believe this will be deadlock free because the lock acquisition is done by a singleton , the resource manager. So there isn't any concurrency around lock acquisition. Usually deadlock acquisition comes from locks being acquired concurrently so this should avoid it. This system does release locks concurrently though so having that happen in parallel is good, and also won't cause a deadlock.free. Deadlocking usually comes from two concurrent things locking two resources and having them
I'm going to add some checklist items to call out the work items more specifically and the design.
Updated by bmbouter about 7 years ago
Here is a bit about the design changes I see:
Updates to _queue_reserved_resource¶
The design needs to switch the resource_id parameter of _queue_reserved_resources to a list of locks instead of just one. This way the resource manager can understand the lock objects that task requires to complete.
The _queue_reserve_resource method needs to look for and create locks differently. The algorithm is similar to today except with slightly more cases. Here is the whole algorithm (not a diff from current which you can see here):
1. A worker with all exact locks is available. This is the first choice option today too, but with single-locking it's just one lock you have to query for.
2. If you can't find someone from 1, then you have to find the workers who hold the existing reservations of the locks you need. If that set is larger than 1 then you wait by continue
in the loop which waits 0.25 seconds already. If that set is size 1 goto step 3.
3. If you find exactly 1 worker who holds a subset of your locks then create a ReservedResource record for that worker for each lock. You need to do it for each lock, not just the ones you are "adding" because the design requires each lock for each task to receive a record.
4. None of your locks are held, so you are looking for an empty worker. If all workers are busy then go around the loop again because it's most efficient to wait. If you do find one, then make N ReservedResource records for that task like in step 3.
Release Resource Updates¶
Also the _release_resource_task needs to be sure that it deletes all locks associated with a task because with this change there could be more than 1.
Task updates¶
The tasks need to be redone so that all the 'tags' in use are looked at. They need to use the labels in the ticket description. i.e. The update/delete REST call to an Importer and the locks the task it creates will have. Also we need to look for other usages of resource_id in the codebase because the locks in this ticket should be the only Pulp locking
Updated by bmbouter about 7 years ago
- Description updated (diff)
I added an additional lock to cause repo versions to lock on the associated repository. If we adopt the bookkeeping format of @mhrivnak's PR, we will need to because it is unsafe to have two repo versions being produced concurrently for the same repo.
Updated by bmbouter almost 7 years ago
- Description updated (diff)
Updated the description to handle cases where locking needs to happen on repositories also.
Updated by dkliban@redhat.com almost 7 years ago
Since the resource manager will be a single entity that locks and unlocks resource, I don't think a deadlock can occur.
Updated by daviddavis almost 7 years ago
- Groomed changed from No to Yes
The design looks good to me. One small item:
That task would multi-lock on the importer and the repo version: ["/api/v3/importers/file-importer/55", "/api/v3/repository/foo/repo_version/72"]
Then a user will DELETE /api/v3/repository/foo/repo_version/72 which will cause a lock of ["/api/v3/repository/foo/repo_version/72"] to be used.
These don't match up with the design in that they should also lock on the repo I believe.
Updated by bmbouter almost 7 years ago
- Description updated (diff)
Thanks @daviddavis. I had not updated that section. Your observations are correct. I just updated those sections as suggested.
Updated by bmbouter almost 7 years ago
- Description updated (diff)
There are no locks on repo versions with the latest design iteration.
Updated by daviddavis almost 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to daviddavis
Updated by daviddavis almost 7 years ago
There is a bit of a challenge that I see now that I am coding this issue. Namely, we need a way to temporarily skip reserved tasks in the resource manager queue.
Currently, in regards to the reserved work in the resource manager queue, the follow algorithm is performed by the _queue_reserved_task function[0]:
1. Find a worker with the same reservation as your task. If one matches, queue the task against the worker and go to the next queued task.
2. If there's not a worker with the reservation you need, just pick the first unreserved worker. Send the task to that worker and go to the next queued task.
3. If there's no free worker, sleep for 0.25 seconds and then go back to step 1
The problem with this is that it is blocking which is ok in the case where you have only one reserved resource per task. We can't simply adapt this algorithm to serve the case where we have tasks with multiple reserved resources.
Take this example: suppose you have a task that needs Repo A and Importer A. Suppose worker 1 holds Repo A and worker 2 holds Importer A. The resource manager can't queue this task against worker 1 or 2 (neither has the needed locks) and it can't send this work to worker 3 which may be available. Instead, it sleeps which is bad because there might be other tasks in the queue.
Optimally, the resource manager should skip this task and then proceed to the next task. I think need to find a way to requeue the task if resource manager can't find a task to execute.
Interested to hear other thoughts on this though.
Updated by bmbouter almost 7 years ago
The not-optimal example you give is correct; that could happen with this design and it is not optimal. We don't have much of an opportunity to handle this case better. Celery's queueing model runs tasks in the order they are in the queue, so I agree that "requeueing" is the only option to allow subsequent messages to be handled if the HEAD of the queue is blocked.
The problem is that requeueing puts the blocked task onto the end of the queue which will reorder that task relative to the other work the user dispatched. This leads to an ordering correctness problem. For example, say a sync is blocked due to an importer not being available and there is a publish right behind it. If the sync is requeued because it's blocked then the publish will run before the sync. There are all sorts of ordering issues with this type of thing. We could look into the database and try to be smart about these cases, but I'm cautious of such an approach. If someone wants to make a case for this go for it.
From this perspective this inefficiency isn't actually inefficient because of its design, it's inefficient due to the additional first-come-first-serve correctness requirements, so blocking and waiting is probably what we should do.
Ideas and discussion is welcome.
Updated by daviddavis almost 7 years ago
- Status changed from ASSIGNED to POST
Added by daviddavis almost 7 years ago
Added by daviddavis almost 7 years ago
Revision ff0739e9 | View on GitHub
Support the locking of multiple resources
Updated by daviddavis almost 7 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|ff0739e9a9fe4d3c0762e498ae59ec756d02fb71.
Added by dalley almost 7 years ago
Revision b803e825 | View on GitHub
Support multi-resource locking
Fix sync support after it was broken by the addition of multi-resource locking.
Updated by bmbouter about 5 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Support the locking of multiple resources
fixes #3186 https://pulp.plan.io/issues/3186