Failed task did not clean up properly resource reservations
In Automation Hub the task
galaxy_ng.app.tasks.synclist.curate_synclist_repository failed due to redis failure. However resource reservation for that task remained in the database blocking entire tasking system (if workers number = 1).
Workers number: 1
After review, the failure scenario goes like this:
- The tasking code itself runs to completion
- RQ attempts to notify Redis the task is completed (in the RQ registry) in its (
- Interacting with Redis raised an exception at this line: https://github.com/rq/rq/blob/master/rq/worker.py#L932
- This fatal exception raised and handled by Pulp's
handle_job_failurehandler implementation which records the exception (how we know this) and also marks the task as failed
- Also when Redis became unavailable, it forgot the tasks it was storing in memory which includes the
_release_resourcesthat pairs with the now failed task and is intended to release the locks
- The worker never died so other lock cleanup processes never occurred.
- Tasks backup and eventually a sysadmin restarts the processes
- The cleanup code in
mark_worker_offlineis triggered, but since the task is already at FAILED, this line does not issue it's cancellation which would release the locks
- The locks are never released....
Add in code to
mark_worker_offline that will ensure all locks for a worker being cleaned up are released even if the a task failed and its
_release_resources was never delivered. This should occur after the cancellation for all tasks in "completed" states.
Added by bmbouter about 3 years ago
Adds additional lock cleanup to worker cleanup
As another layer of security to guard against lock cleanup not occurring due to Redis not delivering the _release_resource task, ensure all locks are also cleaned up even for tasks that are in their final states.