Project

Profile

Help

Issue #7907

Updated by bmbouter over 3 years ago

In Automation Hub the task `galaxy_ng.app.tasks.synclist.curate_synclist_repository` failed due to redis failure. However resource reservation for that task remained in the database blocking entire tasking system (if workers number = 1). 

 https://gist.github.com/cutwater/4ec7960f0eac2793ca17a78723dca75d 


 **Environment:** 

 ~~~ 
 pulpcore 3.7.1 

 pulp-ansible 0.4.3 

 galaxy-ng 1326eb5f1679880b68e05a48d4377def7c72a95b 
 ~~~ 

 **Workers number:** 1 

 ## Analysis 

 After review, the failure scenario goes like this: 
 1. The tasking code itself runs to completion 
 2. RQ attempts to notify Redis the task is completed (in the RQ registry) in its (`handle_job_success`)[https://github.com/rq/rq/blob/master/rq/worker.py#L925]. 
 3. Interacting with Redis raised an exception at this line: https://github.com/rq/rq/blob/master/rq/worker.py#L932 
 4. This fatal exception raised and handled by Pulp's [`handle_job_failure` handler implementation](https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/worker.py#L104-L120) which records the exception (how we know this) and also marks the task as failed 
 5. Also when Redis became unavailable, it forgot the tasks it was storing in memory which includes the [`_release_resources`](https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/tasks.py#L145) that pairs with the now failed task and is intended to release the locks 
 6. The worker never died so other lock cleanup processes never occurred. 
 7. Tasks backup and eventually a sysadmin restarts the processes 
 8. The cleanup code in [`mark_worker_offline`](https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/services/worker_watcher.py#L133) is triggered, but since the task is already at FAILED, [this line](https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/services/worker_watcher.py#L163) does not issue it's cancellation which would release the locks 
 9. The locks are never released.... 

 ## Solution 

 Add in code to [`mark_worker_offline`](https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/services/worker_watcher.py#L133) that will ensure all locks for a worker being cleaned up are released even if the a task failed and its `_release_resources` was never delivered. This should occur after the cancellation for all tasks in "completed" states.

Back