Project

Profile

Help

Issue #2861

Queued tasks will be lost if the resource manager is restarted via systemctl

Added by dalley over 3 years ago. Updated almost 2 years ago.

Status:
CLOSED - WONTFIX
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 23
Quarter:

Description

The resource_manager queue loses a currently running _queue_reserved_task if the resource manager is restarted with sudo systemctl restart pulp_resource_manager.

The task is lost from the queue but still has an incorrect TaskStatus record showing as waiting which will never run.

Note that if you sudo pkill -9 -f resource_manager and the sudo systemctl start pulp_resource_manager it does not lose the task.

sudo systemctl stop pulp_workers
pulp-admin rpm repo sync run --repo-id zoo
qpid-stat -q                        <<-- observe that the queue depth of the resource_manager queue is 1
sudo systemctl restart pulp_resource_manager
qpid-stat -q                        <<-- observe that the queue depth of the resource_manager queue is 0
pulp-admin tasks list -s waiting    <<-- observe that the task which is gone is listed as 'waiting', but it will never run because it is gone

Related issues

Related to Pulp - Issue #2734: task group never quite finishes if pulp is restarted in the middle of task runCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4CLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

History

#1 Updated by dalley over 3 years ago

  • Description updated (diff)

#2 Updated by bmbouter over 3 years ago

I believe the root cause of this is in Celery code. The next step is to prove that by making a simple rabbitMQ + Celery 4.0.z reproducer. The reproducer can use a task like:

from celery import Task  ## not 100% sure on this line, but you get the idea

@task(base=Task, acks_late=True)
def dummy_task(name, inner_task_id, resource_id, inner_args, inner_kwargs, options):
    import time
    time.sleep(600)  # sleep for five minutes

The key is to have a worker handling a task with acks_late=True when it is restarted. Also we want to even remove systemd from the reproducer. Ultimately systemd just sends signals to the process so you can file a better reproducer if you tell celery to restart using the signal command.

#3 Updated by bmbouter over 3 years ago

Oh also this should be added to the sprint. It's a reliability issue that will affect both Pulp2 and Pulp3.

#5 Updated by Ichimonji10 over 3 years ago

  • Description updated (diff)

#6 Updated by ttereshc over 3 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to 40
  • Triaged changed from No to Yes

#7 Updated by mhrivnak over 3 years ago

  • Sprint/Milestone changed from 40 to 41

#8 Updated by daviddavis over 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#9 Updated by daviddavis over 3 years ago

FWIW I'm seeing the queue empty as well when I just run systemctl stop:

[vagrant@pulp2 ~]$ qpid-stat -q
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  8d7f6fbb-a788-4025-9370-264eef9214aa:1.0       Y        Y        0     4      4       0   2.42k    2.42k        1     2
  9b04afc7-3f8a-46b4-8eae-0d1b7d5a0947:1.0       Y        Y        0     2      2       0    486      486         1     2
  c959209f-4f80-4690-9a8d-45a326f6ffa6:0.0       Y        Y        0     0      0       0      0        0         1     2
  celery                                    Y                      2    98     96    1.67k  81.4k    79.7k        0     2
  f2768b43-411b-40b9-9277-ec1770c1555c:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  fdbde372-02dd-4a27-b8d1-f394903cfd9f:1.0       Y        Y        0     8      8       0   4.88k    4.88k        1     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      1     4      3    1.19k  4.76k    3.58k        0     2
[vagrant@pulp2 ~]$ sudo systemctl stop pulp_resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  8d7f6fbb-a788-4025-9370-264eef9214aa:1.0       Y        Y        0     4      4       0   2.42k    2.42k        1     2
  97b59cc7-1398-48cf-99eb-6f2a14810dec:0.0       Y        Y        0     0      0       0      0        0         1     2
  celery                                    Y                      2    98     96    1.67k  81.4k    79.7k        0     2
  f2768b43-411b-40b9-9277-ec1770c1555c:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      0     4      4       0   4.76k    4.76k        0     2

#10 Updated by daviddavis over 3 years ago

Looks like using "sudo pkill -f resource_manager" causes the queue to become empty:

[vagrant@pulp2 ~]$ qpid-stat -q
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  54271755-b7f0-4587-b801-2864083ba242:0.0       Y        Y        0     0      0       0      0        0         1     2
  887d958d-49d5-404f-8820-37d8abae3245:1.0       Y        Y        0     2      2       0    486      486         1     2
  a560e0d7-e702-4967-9ebf-9ea20922688d:1.0       Y        Y        0     8      8       0   4.88k    4.88k        1     2
  a560e0d7-e702-4967-9ebf-9ea20922688d:2.0       Y        Y        0     4      4       0   2.47k    2.47k        1     2
  celery                                    Y                      0     0      0       0      0        0         0     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      1     2      1    1.19k  2.37k    1.19k        1     2
  resource_manager@pulp2.dev.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  resource_manager@pulp2.dev.dq             Y    Y                 0     0      0       0      0        0         1     2
[vagrant@pulp2 ~]$ sudo pkill -f resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q                                                                                                                                                                                  
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  c45eaa67-b86c-4267-93b9-f23a67b0c3c5:0.0       Y        Y        0     0      0       0      0        0         1     2
  celery                                    Y                      0     0      0       0      0        0         0     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      0     2      2       0   2.37k    2.37k        0     2

This corresponds to what we're seeing with systemctl stop which uses SIGTERM, waits for a period of time, and then sends SIGKILL. Looks like the issue is with the SIGTERM.

#11 Updated by daviddavis over 3 years ago

After doing some Googling, I found this particular bug:

https://github.com/celery/celery/issues/3057

In debugging, I found that if I comment out this section, restarting pulp_resource_manager doesn't wipe the queue:

https://github.com/celery/celery/blob/51a494019e863188b39f86aec79e23305ba97311/celery/worker/job.py#L440-L442

This seems to line up with what @ask says. When resource_manager is killed via warm shutdown, it acknowledges any queued tasks. This doesn't happen during a cold shutdown (kill -9).

Not sure how to solve this.

#12 Updated by daviddavis over 3 years ago

Looks like they've solved this problem in Celery 4 by adding an option to reject and then requeue the task when the worker is killed:

http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost

Here's the code:

https://github.com/celery/celery/blob/199cf69f98f3aa655fd9ccd59a09d22de2716b2d/celery/worker/request.py#L368-L379

#13 Updated by daviddavis over 3 years ago

So the decision was made to fix this bug for Celery 4. However, I am unable to reproduce this bug on Celery 4.0.2. I suspect that celery added a different codepath to properly handle a warm shutdown.

This may or may not be related to this bug:

https://github.com/celery/celery/issues/3802

I am tempted to close this bug out since it only affects Celery 3 and it only drops the first message in the queue (the one being processed) .

#14 Updated by mhrivnak over 3 years ago

  • Sprint/Milestone changed from 41 to 42

#15 Updated by daviddavis over 3 years ago

Next step is that we need to confirm that the request is getting acknowledge on warm shutdown for Celery 4. It should be. Test against qpid and rabbitmq.

Also, we should be setting task_reject_on_worker_lost to true for celery 4 in pulp but only for the resource manager.

Lastly, we need to fix this for Pulp 2 AND Pulp 3.

#16 Updated by ttereshc over 3 years ago

  • Related to Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run added

#17 Updated by daviddavis over 3 years ago

Digging into celery 4 with rabbitmq, it looks like it's hitting on_failure in request with a WorkerLostError when I restart pulp_resource_manager:

> /usr/lib/python2.7/site-packages/celery/worker/request.py(377)on_failure()
-> if reject:
(Pdb) exc
WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
(Pdb) type(exc)
<class 'billiard.exceptions.WorkerLostError'>

From debugging, it's calling acknowledge()[1]. However, I'm still stumped as to why the queue isn't being cleared like it is in celery 3.

Also, one other observation: if I start up pulp_workers afterwards, the sync task gets processed.

[1] https://github.com/celery/celery/blob/87b263bcea88756d870d19f27af9cb54c6f860cf/celery/worker/request.py#L379

#18 Updated by daviddavis over 3 years ago

  • Status changed from ASSIGNED to CLOSED - WONTFIX

To fix this for Celery 3, we'd need to carry a patch for Celery and I don't think we want to do that. I am going to close this.

I've opened a separate task for Celery 4:

https://pulp.plan.io/issues/2954

#19 Updated by daviddavis over 3 years ago

  • Related to Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4 added

#20 Updated by bmbouter almost 3 years ago

  • Sprint set to Sprint 23

#21 Updated by bmbouter almost 3 years ago

  • Sprint/Milestone deleted (42)

#22 Updated by bmbouter almost 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF