Project

Profile

Help

Issue #7912

The tasking system deadlocks when Redis looses its tasks

Added by bmbouter 11 months ago. Updated 9 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 89
Quarter:

Description

It's been observed that Redis can "loose" a task from it's queue. This is believed to be the situation in https://pulp.plan.io/issues/7907 for example.

If Redis looses the _release_resource task, but the worker stays running, no cleanup mechanism will release the resource locks the worker aquired. In this situation the tasking could backup indefinitely. With the bugfix of https://pulp.plan.io/issues/7907, restarting the worker will release the locks, but it would be better if a restart of processes was not required.

Simulating this experience

  1. Create a remote
  2. Call sync 5 times in a row (to cause them to all be dispatched to the same worker)
  3. Verify they are in the worker's queue with rq info
  4. Tell RQ to forget the tasks without telling Pulp by running rq empty <the_worker's_queue_name>. For example rq empty 47640@pulp3-source-fedora32.localhost.example.com.

Improving this situation

Have the resource manager perform a health check at the two sleep(0.25) statements. One here and the other here.

Those are the exact points where the resource manager cannot dispatch the next task because it's waiting for locks to be released, and if the failure situation is occurring, it will wait there indefinitely. So that's the correct (and efficient) place to check it.

Specifically have the resource manager run a health check that:

for worker in all_the_online_workers:
    for task in all tasks assigned to that worker in incomplete states:
        ask RQ if it has that job ID in the worker's queue still

# also do the same for the resource manager
for task in all tasks that are not yet assigned to a worker:
    ask RQ if the resource manager queue has the job, by it's rq_job_id

If all the jobs are in RQ then just continue to loop around in _queue_reserved_resource. Any of the jobs that are missing, call cancel() on them. If you can't connect to Redis or experience a fatal exception, just let it raise.


Related issues

Related to Pulp - Backport #7950: Backport 7912CLOSED - WONTFIX

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Blocks Pulp - Backport #7951: Backport 7907CLOSED - WONTFIX

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision ca9a98b2 View on GitHub
Added by dalley 9 months ago

Remove separate task for releasing resources

Move it to an after-task action

re: #7912 https://pulp.plan.io/issues/7912

Revision ac83f33c View on GitHub
Added by dalley 9 months ago

Avoid deadlocks when Redis is shut down and tasks are lost

closes: #7912 https://pulp.plan.io/issues/7912

Revision eaa61e99 View on GitHub
Added by ttereshc 9 months ago

Ensure that a task is created in pulp not earlier than the job in redis

Otherwise it can get cancelled when we check for missing tasks/jobs.

re #7912 https://pulp.plan.io/issues/7912

History

#1 Updated by ttereshc 11 months ago

  • Subject changed from The tasking system deadlocks when Redis looses it's tasks to The tasking system deadlocks when Redis looses its tasks

#2 Updated by fao89 11 months ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 87

#3 Updated by dalley 11 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley

#4 Updated by daviddavis 11 months ago

#5 Updated by fao89 11 months ago

#6 Updated by pulpbot 10 months ago

  • Status changed from ASSIGNED to POST

#8 Updated by rchan 10 months ago

  • Sprint changed from Sprint 87 to Sprint 88

#9 Updated by rchan 9 months ago

  • Sprint changed from Sprint 88 to Sprint 89

#10 Updated by dalley 9 months ago

  • Status changed from POST to MODIFIED

#12 Updated by ttereshc 9 months ago

  • Sprint/Milestone set to 3.10.0

#13 Updated by pulpbot 9 months ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Please register to edit this issue

Also available in: Atom PDF