Actions
Issue #7907
closedFailed task did not clean up properly resource reservations
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 86
Quarter:
Description
In Automation Hub the task galaxy_ng.app.tasks.synclist.curate_synclist_repository
failed due to redis failure. However resource reservation for that task remained in the database blocking entire tasking system (if workers number = 1).
https://gist.github.com/cutwater/4ec7960f0eac2793ca17a78723dca75d
Environment:
pulpcore 3.7.1
pulp-ansible 0.4.3
galaxy-ng 1326eb5f1679880b68e05a48d4377def7c72a95b
Workers number: 1
Analysis¶
After review, the failure scenario goes like this:
- The tasking code itself runs to completion
- RQ attempts to notify Redis the task is completed (in the RQ registry) in its (
handle_job_success
)[https://github.com/rq/rq/blob/master/rq/worker.py#L925]. - Interacting with Redis raised an exception at this line: https://github.com/rq/rq/blob/master/rq/worker.py#L932
- This fatal exception raised and handled by Pulp's
handle_job_failure
handler implementation which records the exception (how we know this) and also marks the task as failed - Also when Redis became unavailable, it forgot the tasks it was storing in memory which includes the
_release_resources
that pairs with the now failed task and is intended to release the locks - The worker never died so other lock cleanup processes never occurred.
- Tasks backup and eventually a sysadmin restarts the processes
- The cleanup code in
mark_worker_offline
is triggered, but since the task is already at FAILED, this line does not issue it's cancellation which would release the locks - The locks are never released....
Solution¶
Add in code to mark_worker_offline
that will ensure all locks for a worker being cleaned up are released even if the a task failed and its _release_resources
was never delivered. This should occur after the cancellation for all tasks in "completed" states.
Related issues
Updated by fao89 about 4 years ago
- Related to Issue #7386: Task that does not exist in worker or resource-manager are never cleaned up added
Updated by fao89 about 4 years ago
- Priority changed from Normal to High
- Triaged changed from No to Yes
Updated by bmbouter about 4 years ago
- Subject changed from Failed curate_synclist_repository task did not clean up properly resource reservations to Failed task did not clean up properly resource reservations
- Description updated (diff)
Updated by bmbouter about 4 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to bmbouter
Updated by pulpbot about 4 years ago
- Status changed from ASSIGNED to POST
Added by bmbouter about 4 years ago
Updated by bmbouter about 4 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulpcore|516df3147b56660fbc9b22c215e309da1ff8080e.
Updated by pulpbot about 4 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Updated by daviddavis about 4 years ago
- Related to Backport #7951: Backport 7907 added
Updated by ttereshc about 3 years ago
- Copied to Backport #9547: Backport "Failed task did not clean up properly resource reservations" to 3.7.z added
Actions
Adds additional lock cleanup to worker cleanup
As another layer of security to guard against lock cleanup not occurring due to Redis not delivering the _release_resource task, ensure all locks are also cleaned up even for tasks that are in their final states.
closes #7907