Issue #2861
closedQueued tasks will be lost if the resource manager is restarted via systemctl
Description
The resource_manager queue loses a currently running _queue_reserved_task if the resource manager is restarted with sudo systemctl restart pulp_resource_manager
.
The task is lost from the queue but still has an incorrect TaskStatus record showing as waiting which will never run.
Note that if you sudo pkill -9 -f resource_manager
and the sudo systemctl start pulp_resource_manager
it does not lose the task.
sudo systemctl stop pulp_workers
pulp-admin rpm repo sync run --repo-id zoo
qpid-stat -q <<-- observe that the queue depth of the resource_manager queue is 1
sudo systemctl restart pulp_resource_manager
qpid-stat -q <<-- observe that the queue depth of the resource_manager queue is 0
pulp-admin tasks list -s waiting <<-- observe that the task which is gone is listed as 'waiting', but it will never run because it is gone
Related issues
Updated by bmbouter over 7 years ago
I believe the root cause of this is in Celery code. The next step is to prove that by making a simple rabbitMQ + Celery 4.0.z reproducer. The reproducer can use a task like:
from celery import Task ## not 100% sure on this line, but you get the idea
@task(base=Task, acks_late=True)
def dummy_task(name, inner_task_id, resource_id, inner_args, inner_kwargs, options):
import time
time.sleep(600) # sleep for five minutes
The key is to have a worker handling a task with acks_late=True
when it is restarted. Also we want to even remove systemd from the reproducer. Ultimately systemd just sends signals to the process so you can file a better reproducer if you tell celery to restart using the signal
command.
Updated by bmbouter over 7 years ago
Oh also this should be added to the sprint. It's a reliability issue that will affect both Pulp2 and Pulp3.
Updated by ttereshc over 7 years ago
- Priority changed from Normal to High
- Sprint/Milestone set to 40
- Triaged changed from No to Yes
Updated by daviddavis over 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to daviddavis
Updated by daviddavis over 7 years ago
FWIW I'm seeing the queue empty as well when I just run systemctl stop:
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
8d7f6fbb-a788-4025-9370-264eef9214aa:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2
9b04afc7-3f8a-46b4-8eae-0d1b7d5a0947:1.0 Y Y 0 2 2 0 486 486 1 2
c959209f-4f80-4690-9a8d-45a326f6ffa6:0.0 Y Y 0 0 0 0 0 0 1 2
celery Y 2 98 96 1.67k 81.4k 79.7k 0 2
f2768b43-411b-40b9-9277-ec1770c1555c:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
fdbde372-02dd-4a27-b8d1-f394903cfd9f:1.0 Y Y 0 8 8 0 4.88k 4.88k 1 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 1 4 3 1.19k 4.76k 3.58k 0 2
[vagrant@pulp2 ~]$ sudo systemctl stop pulp_resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
8d7f6fbb-a788-4025-9370-264eef9214aa:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2
97b59cc7-1398-48cf-99eb-6f2a14810dec:0.0 Y Y 0 0 0 0 0 0 1 2
celery Y 2 98 96 1.67k 81.4k 79.7k 0 2
f2768b43-411b-40b9-9277-ec1770c1555c:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 0 4 4 0 4.76k 4.76k 0 2
Updated by daviddavis over 7 years ago
Looks like using "sudo pkill -f resource_manager" causes the queue to become empty:
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
54271755-b7f0-4587-b801-2864083ba242:0.0 Y Y 0 0 0 0 0 0 1 2
887d958d-49d5-404f-8820-37d8abae3245:1.0 Y Y 0 2 2 0 486 486 1 2
a560e0d7-e702-4967-9ebf-9ea20922688d:1.0 Y Y 0 8 8 0 4.88k 4.88k 1 2
a560e0d7-e702-4967-9ebf-9ea20922688d:2.0 Y Y 0 4 4 0 2.47k 2.47k 1 2
celery Y 0 0 0 0 0 0 0 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 1 2 1 1.19k 2.37k 1.19k 1 2
resource_manager@pulp2.dev.celery.pidbox Y 0 0 0 0 0 0 1 2
resource_manager@pulp2.dev.dq Y Y 0 0 0 0 0 0 1 2
[vagrant@pulp2 ~]$ sudo pkill -f resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
c45eaa67-b86c-4267-93b9-f23a67b0c3c5:0.0 Y Y 0 0 0 0 0 0 1 2
celery Y 0 0 0 0 0 0 0 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 0 2 2 0 2.37k 2.37k 0 2
This corresponds to what we're seeing with systemctl stop which uses SIGTERM, waits for a period of time, and then sends SIGKILL. Looks like the issue is with the SIGTERM.
Updated by daviddavis over 7 years ago
After doing some Googling, I found this particular bug:
https://github.com/celery/celery/issues/3057
In debugging, I found that if I comment out this section, restarting pulp_resource_manager doesn't wipe the queue:
This seems to line up with what @ask says. When resource_manager is killed via warm shutdown, it acknowledges any queued tasks. This doesn't happen during a cold shutdown (kill -9).
Not sure how to solve this.
Updated by daviddavis over 7 years ago
Looks like they've solved this problem in Celery 4 by adding an option to reject and then requeue the task when the worker is killed:
http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost
Here's the code:
Updated by daviddavis over 7 years ago
So the decision was made to fix this bug for Celery 4. However, I am unable to reproduce this bug on Celery 4.0.2. I suspect that celery added a different codepath to properly handle a warm shutdown.
This may or may not be related to this bug:
https://github.com/celery/celery/issues/3802
I am tempted to close this bug out since it only affects Celery 3 and it only drops the first message in the queue (the one being processed) .
Updated by daviddavis over 7 years ago
Next step is that we need to confirm that the request is getting acknowledge on warm shutdown for Celery 4. It should be. Test against qpid and rabbitmq.
Also, we should be setting task_reject_on_worker_lost to true for celery 4 in pulp but only for the resource manager.
Lastly, we need to fix this for Pulp 2 AND Pulp 3.
Updated by ttereshc over 7 years ago
- Related to Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run added
Updated by daviddavis over 7 years ago
Digging into celery 4 with rabbitmq, it looks like it's hitting on_failure in request with a WorkerLostError when I restart pulp_resource_manager:
> /usr/lib/python2.7/site-packages/celery/worker/request.py(377)on_failure()
-> if reject:
(Pdb) exc
WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
(Pdb) type(exc)
<class 'billiard.exceptions.WorkerLostError'>
From debugging, it's calling acknowledge()[1]. However, I'm still stumped as to why the queue isn't being cleared like it is in celery 3.
Also, one other observation: if I start up pulp_workers afterwards, the sync task gets processed.
Updated by daviddavis over 7 years ago
- Status changed from ASSIGNED to CLOSED - WONTFIX
To fix this for Celery 3, we'd need to carry a patch for Celery and I don't think we want to do that. I am going to close this.
I've opened a separate task for Celery 4:
Updated by daviddavis over 7 years ago
- Related to Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4 added