Issue #7907: Failed task did not clean up properly resource reservations - Pulp

Actions

Send by e-mail Copy link

Issue #7907

closed

Failed task did not clean up properly resource reservations

Added by osapryki about 4 years ago. Updated about 4 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

High

Assignee:

bmbouter

Category:

Sprint/Milestone:

3.9.0

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Sprint:

Sprint 86

Quarter:

Description

In Automation Hub the task galaxy_ng.app.tasks.synclist.curate_synclist_repository failed due to redis failure. However resource reservation for that task remained in the database blocking entire tasking system (if workers number = 1).

https://gist.github.com/cutwater/4ec7960f0eac2793ca17a78723dca75d

Environment:

pulpcore 3.7.1

pulp-ansible 0.4.3

galaxy-ng 1326eb5f1679880b68e05a48d4377def7c72a95b

Workers number: 1

Analysis¶

After review, the failure scenario goes like this:

The tasking code itself runs to completion
RQ attempts to notify Redis the task is completed (in the RQ registry) in its (handle_job_success)[https://github.com/rq/rq/blob/master/rq/worker.py#L925].
Interacting with Redis raised an exception at this line: https://github.com/rq/rq/blob/master/rq/worker.py#L932
This fatal exception raised and handled by Pulp's handle_job_failure handler implementation which records the exception (how we know this) and also marks the task as failed
Also when Redis became unavailable, it forgot the tasks it was storing in memory which includes the _release_resources that pairs with the now failed task and is intended to release the locks
The worker never died so other lock cleanup processes never occurred.
Tasks backup and eventually a sysadmin restarts the processes
The cleanup code in mark_worker_offline is triggered, but since the task is already at FAILED, this line does not issue it's cancellation which would release the locks
The locks are never released....

Solution¶

Add in code to mark_worker_offline that will ensure all locks for a worker being cleaned up are released even if the a task failed and its _release_resources was never delivered. This should occur after the cancellation for all tasks in "completed" states.

Related issues

Actions

Copy link

Updated by fao89 about 4 years ago

Related to Issue #7386: Task that does not exist in worker or resource-manager are never cleaned up added

Actions

Copy link

Updated by fao89 about 4 years ago

Priority changed from Normal to High
Triaged changed from No to Yes

Actions

Copy link

Updated by bmbouter about 4 years ago

Subject changed from Failed curate_synclist_repository task did not clean up properly resource reservations to Failed task did not clean up properly resource reservations
Description updated (diff)

Actions

Copy link

Updated by bmbouter about 4 years ago

Status changed from NEW to ASSIGNED
Assignee set to bmbouter

Actions

Copy link

Updated by dkliban@redhat.com about 4 years ago

Sprint set to Sprint 86

Actions

Copy link

Updated by pulpbot about 4 years ago

Status changed from ASSIGNED to POST

PR: https://github.com/pulp/pulpcore/pull/1046

Added by bmbouter about 4 years ago

Revision 516df314 | View on GitHub

Adds additional lock cleanup to worker cleanup

As another layer of security to guard against lock cleanup not occurring due to Redis not delivering the _release_resource task, ensure all locks are also cleaned up even for tasks that are in their final states.

closes #7907

Actions

Copy link