Issue #3817
closedTask assigned to the resource manager not cancel when it stops
Description
When stopping resource manager worker, tasks that assigned to the resource manager will not be cancelled and leaving the tasks in "Waiting" state forever. This is because the "worker_name" of the TaskStatus is not set during the creation and Pulp is unable to find the tasks for the resource manager during the worker cleanup.
Below is how I reproduce it.
1) Stop the all pulp workers and leave only pulp resource manager running.
2) Trigger a regenerate applicability for consumers task
3) Check the Qpid stat and I saw 1 message.
- qpid-stat -q -b amqps://localhost:5671
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
==================================================================================================================================================
resource_manager Y 1 1 0 1.36k 1.36k 0 1 2
4) Stop resource manager. 'systemctl stop pulp_resource_manager' and I saw the message is dequeued
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
================================================================================================================================
resource_manager Y 0 1 1 0 1.36k 1.36k 0 2
5) Check the task and I saw the task is still in "Waiting" state.
pulp_tasks:
- exception:
task_type: pulp.server.managers.consumer.applicability.regenerate_applicability_for_consumers
_href: "/pulp/api/v2/tasks/7885eaad-7563-4814-981a-b6e23cf459c4/"
task_id: 7885eaad-7563-4814-981a-b6e23cf459c4
tags:
- pulp:action:consumer_content_applicability_regeneration
finish_time:
_ns: task_status
start_time:
traceback:
spawned_tasks: []
progress_report: {}
queue: None.dq
state: waiting
worker_name:
Updated by hyu almost 6 years ago
I think this commit should fix the issue.
https://github.com/pulp/pulp/pull/3540/commits/3f84c6876bb7d78c36aefda7b0d510614f84395e
Updated by bmbouter almost 6 years ago
- Status changed from NEW to CLOSED - WORKSFORME
- Assignee set to bmbouter
I was able to reproduce this on an EL7 box with Pulp server 2.13.4 and python-celery-3.1.17-1.el7sat.noarch. When issuing the stop specifically it says:
celery.worker.job:ERROR: (10040-25760) Task pulp.server.async.tasks._queue_reserved_task[15a93e5d-626e-47bb-8917-cbeb3fad1ef6] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
celery.worker.job:ERROR: (10040-25760) Traceback (most recent call last):
celery.worker.job:ERROR: (10040-25760) File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost
celery.worker.job:ERROR: (10040-25760) human_status(exitcode)),
celery.worker.job:ERROR: (10040-25760) WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
I could not reproduce this on my fedora 27 installation with newer Pulp code and a newer celery version also. The newer pulp code since 2.13 makes no changes in this area so I believe having the newer Celery is what is resolving it in my dev environment. There when issuing the sudo systemctl stop pulp_resource_manager
it waits a long time for a gracefull shutdown but then in the logs says systemctl killed it with SIGKILL. When listing my qpid-stat -q there it shows the task is still queued. When starting up workers the task resumes as expected and transitions from the WAITING state.
So we can't accept this patch for 2 reasons. 1) we want the task status to stay as waiting so that it can resume. 2) The 'worker' field is for the actual worker who will process the task. The resource manager doesn't actually do that processing, it's just a coordinator. We don't want tasks to cancel at the resource manager, that is why we uses the Celery acks_late feature.
I think the issue is that acks_late is broken in the older version of Celery, so I believe the resolution is to upgrade Celery to Celery 4.
I believe the best resolution is to upgrade your EL7 celery stack. Pulp will recognize the newer dependency and use it. It is compatible with both Celery 3 and Celery 4 as of 2.12.2 with this work: https://pulp.plan.io/issues/2527
I don't see a code change in Pulp as the right resolution to this ticket so I'm closing as WORKSFORME. I believe the issue is in Celery itself and we're in a no-fix mode for Celery 3 issues since Celery 4 is the current active stream.
Please leave more comments this if you think we can do anything better with this.