Issue #7912
closedThe tasking system deadlocks when Redis looses its tasks
Description
It's been observed that Redis can "loose" a task from it's queue. This is believed to be the situation in https://pulp.plan.io/issues/7907 for example.
If Redis looses the _release_resource
task, but the worker stays running, no cleanup mechanism will release the resource locks the worker aquired. In this situation the tasking could backup indefinitely. With the bugfix of https://pulp.plan.io/issues/7907, restarting the worker will release the locks, but it would be better if a restart of processes was not required.
Simulating this experience¶
- Create a remote
- Call sync 5 times in a row (to cause them to all be dispatched to the same worker)
- Verify they are in the worker's queue with
rq info
- Tell RQ to forget the tasks without telling Pulp by running
rq empty <the_worker's_queue_name>
. For examplerq empty 47640@pulp3-source-fedora32.localhost.example.com
.
Improving this situation¶
Have the resource manager perform a health check at the two sleep(0.25)
statements. One here and the other here.
Those are the exact points where the resource manager cannot dispatch the next task because it's waiting for locks to be released, and if the failure situation is occurring, it will wait there indefinitely. So that's the correct (and efficient) place to check it.
Specifically have the resource manager run a health check that:
for worker in all_the_online_workers:
for task in all tasks assigned to that worker in incomplete states:
ask RQ if it has that job ID in the worker's queue still
# also do the same for the resource manager
for task in all tasks that are not yet assigned to a worker:
ask RQ if the resource manager queue has the job, by it's rq_job_id
If all the jobs are in RQ then just continue
to loop around in _queue_reserved_resource
.
Any of the jobs that are missing, call cancel() on them.
If you can't connect to Redis or experience a fatal exception, just let it raise.
Related issues
Remove separate task for releasing resources
Move it to an after-task action
re: #7912 https://pulp.plan.io/issues/7912