Issue #8352
closedPossible race condition in reserved resources
Description
('task_href: %s, has failed with error %f', '/pulp/api/v3/tasks/111ef25e-5798-4bdf-b619-ad120ea31ebb/',
{'traceback': ' File "/opt/bats/lib/python3.8/site-packages/rq/worker.py", line 982, in perform_job
self.handle_job_success(job=job, File "/opt/bats/lib/python3.8/site-packages/pulpcore/tasking/worker.py", line 143, in handle_job_success
task.release_resources()
File "/opt/bats/lib/python3.8/site-packages/pulpcore/app/models/task.py", line 429, in release_resources
reservation.delete()
File "/opt/bats/lib/python3.8/site-packages/django_lifecycle/mixins.py", line 141, in delete
super().delete(*args, **kwargs)
File "/opt/bats/lib/python3.8/site-packages/django/db/models/base.py", line 922, in delete
return collector.delete()
File "/opt/bats/lib/python3.8/site-packages/django/db/models/deletion.py", line 317, in delete
signals.post_delete.send(
File "/opt/bats/lib/python3.8/site-packages/django/db/transaction.py", line 240, in __exit__
connection.commit()
File "/opt/bats/lib/python3.8/site-packages/django/db/backends/base/base.py", line 262, in commit
self._commit()
File "/opt/bats/lib/python3.8/site-packages/django/db/backends/base/base.py", line 240, in _commit
return self.connection.commit()
File "/opt/bats/lib/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/opt/bats/lib/python3.8/site-packages/django/db/backends/base/base.py", line 240, in _commit
return self.connection.commit()',
'description': 'update or delete on table "core_reservedresource" violates foreign key constraint "core_taskreservedres_resource_id_ee0b7c62_fk_core_rese" on table "core_taskreservedresource"
DETAIL: Key (pulp_id)=(f680438a-966e-46bb-9a40-5e761237ad1e) is still referenced from table "core_taskreservedresource".
'})
Updated by mdellweg over 3 years ago
It looks like the task cleanup code looks for any task that is still attached to a reserved_resource before deleting it. But that leaves a small window where another task can attach to that reserved_resource letting the delete fail with a foreign_key_contstraint.
The safe way (i think) is to try to delete it and ignore the ConstrainetException.
https://github.com/pulp/pulpcore/blob/master/pulpcore/app/models/task.py#L422
Updated by osapryki over 3 years ago
The root cause of the problem is that pulp worker process is not atomic. Worker cleanup is executed by workers in a concurrent environment. When a worker queries for worker record [1] and related resources [2] [3] and before worker is marked as cleaned up [4], the resource manager process may insert new records that will be neither queried nor deleted after worker cleanup process finishes.
Proposed solution to this problem is using database locks on a worker record to prevent resource manager assigning new resources on worker when the cleanup process is running.
Implementation detauls:
- Worker cleanup process must run in a transaction.
- Querying a worker for cleanup must use
FOR UPDATE
row level lock. - Resource manager resource reservation must run in a transaction.
- Resource manager must also use
FOR UPDATE
lock on a worker.
When resource manager queries a worker with FOR UPDATE
lock, if the worker cleanup process has started, resource manager will wait until the worker cleanup process finishes.
Resource manager should lock only the worker it assigns task and resources to.
References:
- https://github.com/pulp/pulpcore/blob/3d3a7849fc7e7b4489e664f4a93694044aae8404/pulpcore/tasking/worker_watcher.py#L179
- https://github.com/pulp/pulpcore/blob/3d3a7849fc7e7b4489e664f4a93694044aae8404/pulpcore/tasking/worker_watcher.py#L184
- https://github.com/pulp/pulpcore/blob/3d3a7849fc7e7b4489e664f4a93694044aae8404/pulpcore/tasking/worker_watcher.py#L184
- https://github.com/pulp/pulpcore/blob/3d3a7849fc7e7b4489e664f4a93694044aae8404/pulpcore/tasking/worker_watcher.py#L195
Alternative approaches: Alternative solution is to use table level locks on tasks and resource tables.
Updated by pulpbot over 3 years ago
- Status changed from NEW to POST
Added by dalley over 3 years ago
Updated by dalley over 3 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulpcore|289c1cdeda5ccb6c4d5208de0e74b29ee5eef445.
Updated by pulpbot over 3 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Fix race condition in handling of reserved resources
closes: #8352 https://pulp.plan.io/issues/8352