Project

Profile

Help

Issue #2954

Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4

Added by daviddavis over 4 years ago. Updated almost 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.14.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 23
Quarter:

Description

In Celery 3, the resource_manager queue loses a currently running _queue_reserved_task if the resource manager is restarted with sudo systemctl restart pulp_resource_manager.

The task is lost from the queue but still has an incorrect TaskStatus record showing as waiting which will never run.

Note that if you sudo pkill -9 -f resource_manager and the sudo systemctl start pulp_resource_manager it does not lose the task.

sudo systemctl stop pulp_workers
pulp-admin rpm repo sync run --repo-id zoo
qpid-stat -q                        <<-- observe that the queue depth of the resource_manager queue is 1
sudo systemctl restart pulp_resource_manager
qpid-stat -q                        <<-- observe that the queue depth of the resource_manager queue is 0
pulp-admin tasks list -s waiting    <<-- observe that the task which is gone is listed as 'waiting', but it will never run because it is gone

We need to make sure that this doesn't happen in Celery 4. There's a config task that should prevent this:

http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost

Also, we to apply this fix for Pulp 2 AND 3.


Related issues

Related to Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlCLOSED - WONTFIX<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Issue #2958: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lostCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision af4f688a View on GitHub
Added by daviddavis over 4 years ago

Turn on task_reject_on_worker_lost to prevent lost tasks

Turn on task_reject_on_worker_lost (aka CELERY_REJECT_ON_WORKER_LOST) to prevent the loss of tasks when a worker dies. This option is only available in Celery 4+.

fixes #2954 https://pulp.plan.io/issues/2954

History

#1 Updated by daviddavis over 4 years ago

  • Subject changed from Queued tasks will be lost if the resource manager is restarted via systemctl for Celery 4 to Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4

#2 Updated by daviddavis over 4 years ago

It looks like tasks are currently being requeued for Celery 4 even if task_reject_on_worker_lost is not set. They shouldn't be and we need to investigate and potentially open an issue against Celery.

#3 Updated by daviddavis over 4 years ago

  • Related to Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run added

#4 Updated by daviddavis over 4 years ago

  • Related to deleted (Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run)

#5 Updated by daviddavis over 4 years ago

  • Related to Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl added

#6 Updated by daviddavis over 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#7 Updated by ttereshc over 4 years ago

  • Sprint/Milestone set to 42
  • Triaged changed from No to Yes

#8 Updated by daviddavis over 4 years ago

So I was able to reproduce the behavior in standalone celery where messages are persisted on warm shutdown even if task_reject_on_worker_lost is not set. It turns out that if you run pkill -f celery instead of kill $CHILD_PROCESS_ID, the message gets persisted.

This is why when shutting down pulp_resource_manager via systemctl, we're seeing messages getting persisted. It's killing (or doing a warm shutdown) on both processes. I have no idea why this is. I can open an upstream celery issue but this behavior sounds pretty much the same as some existing bugs [1][2].

Also, the message persisting is not a problem for us though. We're concerned about messages being lost and if a message gets persisted, it simply runs the next time pulp starts up. We're not concerned here about double execution either since pulp_workers are not running.

[1] https://github.com/celery/celery/issues/3802
[2] https://github.com/celery/celery/issues/3796

#9 Updated by daviddavis over 4 years ago

So basically we just need to enable task_reject_on_worker_lost and we're good to go.

#10 Updated by daviddavis over 4 years ago

  • Related to Issue #2958: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost added

#11 Updated by daviddavis over 4 years ago

Opened separate issue for Pulp 3:

https://pulp.plan.io/issues/2958

#12 Updated by daviddavis over 4 years ago

  • Status changed from ASSIGNED to POST

#13 Updated by daviddavis over 4 years ago

I would probably recommend using the following workflow for testing as it's a bit more precise in that it only kill the child worker process. Using sudo systemctl restart pulp_resource_manager will kill both the child and the parent which will potentially leave the message in the queue and thus would be a false positive.

sudo systemctl stop pulp_workers # may need to wait 30 seconds for this to die
pulp-admin rpm repo sync run --repo-id zoo --bg
qpid-stat -q # observe that the queue depth of the resource_manager queue is 1
ps auxf | grep resource_manager # grab the child process id (e.g. 12345) 
sudo kill 12345
qpid-stat -q # observe that the queue depth of the resource_manager queue is still 1
sudo systemctl restart pulp_resource_manager
sudo systemctl start pulp_workers # may need to wait 30 seconds for this to start and pick up task
pulp-admin tasks list -s waiting # should be empty

#15 Updated by daviddavis over 4 years ago

  • Status changed from POST to MODIFIED

#16 Updated by pcreech over 4 years ago

  • Platform Release set to 2.14.1

#18 Updated by pcreech over 4 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

#19 Updated by bmbouter almost 4 years ago

  • Sprint set to Sprint 23

#20 Updated by bmbouter almost 4 years ago

  • Sprint/Milestone deleted (42)

#21 Updated by bmbouter almost 3 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF