Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl - Pulp

Actions

Send by e-mail Copy link

Issue #2861

closed

Queued tasks will be lost if the resource manager is restarted via systemctl

Added by dalley over 7 years ago. Updated almost 6 years ago.

Status:

CLOSED - WONTFIX

Priority:

High

Assignee:

daviddavis

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Sprint 23

Quarter:

Description

The resource_manager queue loses a currently running _queue_reserved_task if the resource manager is restarted with sudo systemctl restart pulp_resource_manager.

The task is lost from the queue but still has an incorrect TaskStatus record showing as waiting which will never run.

Note that if you sudo pkill -9 -f resource_manager and the sudo systemctl start pulp_resource_manager it does not lose the task.

sudo systemctl stop pulp_workers
pulp-admin rpm repo sync run --repo-id zoo
qpid-stat -q                        <<-- observe that the queue depth of the resource_manager queue is 1
sudo systemctl restart pulp_resource_manager
qpid-stat -q                        <<-- observe that the queue depth of the resource_manager queue is 0
pulp-admin tasks list -s waiting    <<-- observe that the task which is gone is listed as 'waiting', but it will never run because it is gone

Related issues

Actions

Copy link

Updated by dalley over 7 years ago

Description updated (diff)

Actions

Copy link

Updated by bmbouter over 7 years ago

I believe the root cause of this is in Celery code. The next step is to prove that by making a simple rabbitMQ + Celery 4.0.z reproducer. The reproducer can use a task like:

from celery import Task  ## not 100% sure on this line, but you get the idea

@task(base=Task, acks_late=True)
def dummy_task(name, inner_task_id, resource_id, inner_args, inner_kwargs, options):
    import time
    time.sleep(600)  # sleep for five minutes

The key is to have a worker handling a task with acks_late=True when it is restarted. Also we want to even remove systemd from the reproducer. Ultimately systemd just sends signals to the process so you can file a better reproducer if you tell celery to restart using the signal command.

Actions

Copy link

Updated by bmbouter over 7 years ago

Oh also this should be added to the sprint. It's a reliability issue that will affect both Pulp2 and Pulp3.

Actions

Copy link

Updated by Ichimonji10 over 7 years ago

Description updated (diff)

Actions

Copy link

Updated by ttereshc over 7 years ago

Priority changed from Normal to High
Sprint/Milestone set to 40
Triaged changed from No to Yes

Actions

Copy link

Updated by mhrivnak over 7 years ago

Sprint/Milestone changed from 40 to 41

Actions

Copy link

Updated by daviddavis over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to daviddavis

Actions

Copy link

Updated by daviddavis over 7 years ago

FWIW I'm seeing the queue empty as well when I just run systemctl stop:

[vagrant@pulp2 ~]$ qpid-stat -q
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  8d7f6fbb-a788-4025-9370-264eef9214aa:1.0       Y        Y        0     4      4       0   2.42k    2.42k        1     2
  9b04afc7-3f8a-46b4-8eae-0d1b7d5a0947:1.0       Y        Y        0     2      2       0    486      486         1     2
  c959209f-4f80-4690-9a8d-45a326f6ffa6:0.0       Y        Y        0     0      0       0      0        0         1     2
  celery                                    Y                      2    98     96    1.67k  81.4k    79.7k        0     2
  f2768b43-411b-40b9-9277-ec1770c1555c:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  fdbde372-02dd-4a27-b8d1-f394903cfd9f:1.0       Y        Y        0     8      8       0   4.88k    4.88k        1     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      1     4      3    1.19k  4.76k    3.58k        0     2
[vagrant@pulp2 ~]$ sudo systemctl stop pulp_resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  8d7f6fbb-a788-4025-9370-264eef9214aa:1.0       Y        Y        0     4      4       0   2.42k    2.42k        1     2
  97b59cc7-1398-48cf-99eb-6f2a14810dec:0.0       Y        Y        0     0      0       0      0        0         1     2
  celery                                    Y                      2    98     96    1.67k  81.4k    79.7k        0     2
  f2768b43-411b-40b9-9277-ec1770c1555c:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      0     4      4       0   4.76k    4.76k        0     2

Actions

Copy link

#10

Updated by daviddavis over 7 years ago

Looks like using "sudo pkill -f resource_manager" causes the queue to become empty:

[vagrant@pulp2 ~]$ qpid-stat -q
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  54271755-b7f0-4587-b801-2864083ba242:0.0       Y        Y        0     0      0       0      0        0         1     2
  887d958d-49d5-404f-8820-37d8abae3245:1.0       Y        Y        0     2      2       0    486      486         1     2
  a560e0d7-e702-4967-9ebf-9ea20922688d:1.0       Y        Y        0     8      8       0   4.88k    4.88k        1     2
  a560e0d7-e702-4967-9ebf-9ea20922688d:2.0       Y        Y        0     4      4       0   2.47k    2.47k        1     2
  celery                                    Y                      0     0      0       0      0        0         0     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      1     2      1    1.19k  2.37k    1.19k        1     2
  resource_manager@pulp2.dev.celery.pidbox       Y                 0     0      0       0      0        0         1     2
  resource_manager@pulp2.dev.dq             Y    Y                 0     0      0       0      0        0         1     2
[vagrant@pulp2 ~]$ sudo pkill -f resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q                                                                                                                                                                                  
Queues
  queue                                     dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================
  4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0       Y        Y        0     4      4       0   2.46k    2.46k        1     2
  c45eaa67-b86c-4267-93b9-f23a67b0c3c5:0.0       Y        Y        0     0      0       0      0        0         1     2
  celery                                    Y                      0     0      0       0      0        0         0     2
  pulp.task                                 Y                      0     0      0       0      0        0         3     1
  resource_manager                          Y                      0     2      2       0   2.37k    2.37k        0     2

This corresponds to what we're seeing with systemctl stop which uses SIGTERM, waits for a period of time, and then sends SIGKILL. Looks like the issue is with the SIGTERM.

Actions

Copy link

#11

Updated by daviddavis over 7 years ago

After doing some Googling, I found this particular bug:

https://github.com/celery/celery/issues/3057

In debugging, I found that if I comment out this section, restarting pulp_resource_manager doesn't wipe the queue:

https://github.com/celery/celery/blob/51a494019e863188b39f86aec79e23305ba97311/celery/worker/job.py#L440-L442

This seems to line up with what @ask says. When resource_manager is killed via warm shutdown, it acknowledges any queued tasks. This doesn't happen during a cold shutdown (kill -9).

Not sure how to solve this.

Actions

Copy link

#12

Updated by daviddavis over 7 years ago

Looks like they've solved this problem in Celery 4 by adding an option to reject and then requeue the task when the worker is killed:

http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost

Here's the code:

https://github.com/celery/celery/blob/199cf69f98f3aa655fd9ccd59a09d22de2716b2d/celery/worker/request.py#L368-L379

Actions

Copy link

#13

Updated by daviddavis over 7 years ago

So the decision was made to fix this bug for Celery 4. However, I am unable to reproduce this bug on Celery 4.0.2. I suspect that celery added a different codepath to properly handle a warm shutdown.

This may or may not be related to this bug:

https://github.com/celery/celery/issues/3802

I am tempted to close this bug out since it only affects Celery 3 and it only drops the first message in the queue (the one being processed) .

Actions

Copy link

#14

Updated by mhrivnak over 7 years ago

Sprint/Milestone changed from 41 to 42

Actions

Copy link

#15

Updated by daviddavis over 7 years ago

Next step is that we need to confirm that the request is getting acknowledge on warm shutdown for Celery 4. It should be. Test against qpid and rabbitmq.

Also, we should be setting task_reject_on_worker_lost to true for celery 4 in pulp but only for the resource manager.

Lastly, we need to fix this for Pulp 2 AND Pulp 3.

Actions

Copy link

#16

Updated by ttereshc over 7 years ago

Related to Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run added

Actions

Copy link

#17

Updated by daviddavis over 7 years ago

Digging into celery 4 with rabbitmq, it looks like it's hitting on_failure in request with a WorkerLostError when I restart pulp_resource_manager:

> /usr/lib/python2.7/site-packages/celery/worker/request.py(377)on_failure()
-> if reject:
(Pdb) exc
WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
(Pdb) type(exc)
<class 'billiard.exceptions.WorkerLostError'>

From debugging, it's calling acknowledge()[1]. However, I'm still stumped as to why the queue isn't being cleared like it is in celery 3.

Also, one other observation: if I start up pulp_workers afterwards, the sync task gets processed.

[1] https://github.com/celery/celery/blob/87b263bcea88756d870d19f27af9cb54c6f860cf/celery/worker/request.py#L379

Actions

Copy link

#18