Issue #2664
closedWorkers Canceling Tasks on Startup Fail if the Broker is Down
Description
To reproduce:
Create a 'zoo' repo that syncs from the zoo fixture repo
pulp-admin rpm repo create --repo-id tester --feed https://repos.fedorapeople.org/repos/pulp/pulp/fixtures/rpm/
Sync 'zoo' manually once to see it working. Sync it again, and just when the task begins run:
sudo pkill -9 -f reserved_resource_worker
Then stop qpidd with sudo systemctl stop qpidd
Then restart the pulp_workers sudo systemctl restart pulp_workers
Observe the traceback:
Mar 24 17:25:50 dev pulp[10046]: pulp.server.async.tasks:INFO: Cleaning up shutdown worker 'reserved_resource_worker-0@dev'.
Mar 24 17:25:50 dev celery[10046]: Traceback (most recent call last):
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 274, in _run_finalizers
Mar 24 17:25:50 dev celery[10046]: finalizer()
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 207, in __call__
Mar 24 17:25:50 dev celery[10046]: res = self._callback(*self._args, **self._kwargs)
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 201, in _send_worker_shutdown
Mar 24 17:25:50 dev celery[10046]: signals.worker_shutdown.send(sender=self)
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send
Mar 24 17:25:50 dev celery[10046]: response = receiver(signal=self, sender=sender, **named)
Mar 24 17:25:50 dev celery[10046]: File "/home/vagrant/devel/pulp/server/pulp/server/async/app.py", line 194, in shutdown_worker
Mar 24 17:25:50 dev celery[10046]: tasks._delete_worker(sender.hostname, normal_shutdown=True)
Mar 24 17:25:50 dev celery[10046]: File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 273, in _delete_worker
Mar 24 17:25:50 dev celery[10046]: cancel(task_status['task_id'])
Mar 24 17:25:50 dev celery[10046]: File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 642, in cancel
Mar 24 17:25:50 dev celery[10046]: controller.revoke(task_id, terminate=True)
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 172, in revoke
Mar 24 17:25:50 dev celery[10046]: 'signal': signal}, **kwargs)
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 316, in broadcast
Mar 24 17:25:50 dev celery[10046]: limit, callback, channel=channel,
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast
Mar 24 17:25:50 dev celery[10046]: chan = channel or self.connection.default_channel
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel
Mar 24 17:25:50 dev celery[10046]: self.connection
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection
Mar 24 17:25:50 dev celery[10046]: self._connection = self._establish_connection()
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection
Mar 24 17:25:50 dev celery[10046]: conn = self.transport.establish_connection()
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1600, in establish_connection
Mar 24 17:25:50 dev celery[10046]: conn = self.Connection(**opts)
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1261, in __init__
Mar 24 17:25:50 dev celery[10046]: self._qpid_conn = establish(**self.connection_options)
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 112, in establish
Mar 24 17:25:50 dev celery[10046]: conn.open(timeout=timeout)
Mar 24 17:25:50 dev celery[10046]: File "<string>", line 6, in open
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 323, in open
Mar 24 17:25:50 dev celery[10046]: self.attach(timeout=timeout)
Mar 24 17:25:50 dev celery[10046]: File "<string>", line 6, in attach
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 341, in attach
Mar 24 17:25:50 dev celery[10046]: if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout):
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 274, in _ewait
Mar 24 17:25:50 dev celery[10046]: self.check_error()
Mar 24 17:25:50 dev celery[10046]: File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 267, in check_error
Mar 24 17:25:50 dev celery[10046]: raise e
Mar 24 17:25:50 dev celery[10046]: ConnectError: [Errno 111] Connection refused
Then start qpidd again sudo systemctl start qpidd
Note that the pulp_workers never started. They emit the traceback and shut back down.
Updated by dalley over 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
Updated by bmbouter over 7 years ago
- Description updated (diff)
The impact of this bug claims that the workers are left in a running-but-permamently-broken state with the claim in the last sentence "The workers are in a dirty state". Based on the recent testing we did I think that's incorrect. I think actually: the traceback is emitted but the workers actually fail to start. That isn't great but it is less severe than having a worker that is running but "broken".
Also I'm removing the language suggesting this root cause is realted to Issue #2613 because that is closed+fixed on the version we are testing against, but we are still experiencing the same issue.
Updated by dalley over 7 years ago
- Status changed from ASSIGNED to NEW
- Sprint/Milestone deleted (
41)
After discussion with bmbouter, we decided to remove from the sprint. It's not very important, it's an upstream issue with an unresponsive upstream, and it's been in stasis for 3 weeks already while trying to get more info.
Updated by bmbouter over 7 years ago
In some cases, if the broker is down a worker will be unable to start. Specifically if the worker was shutdown while processing tasks it will try to cancel those tasks on startup. If the first try fails it shuts back down.
Upstream has outlined the fix they would accept and it is very straightforward. It would add a new option to celery's signal.connect called retry which if we set to True would cause automatic retry while the broker is down. If we did this the next steps would be:
1. Make the upstream patch providing the 'retry' option to signal.connect
and get it merged to the upstream 4.0 branch
2. Backport the issue to the 3.1 branch which should be a simple cherry pick without modification
3. update/patch the Fedora versions and the version that we build in the deps/ folder
4. Add retry=True to these two signals: https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L116-L183
Updated by bmbouter over 7 years ago
- Sprint/Milestone set to 41
Adding back to sprint after IRC discussion and input from upstream.
Updated by daviddavis over 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to daviddavis
Updated by daviddavis over 7 years ago
I've been trying to reproduce all day but it seems like the workers do start for me when I restart them via systemctl:
$ sudo systemctl restart pulp_workers
$ ps aux | grep reserved_resource_worker
apache 10506 0.9 4.5 931728 92784 ? Ssl 18:21 0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid
apache 10509 0.9 4.5 931496 92292 ? Ssl 18:21 0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid
apache 10514 0.9 4.4 931144 92120 ? Ssl 18:21 0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid
apache 10519 0.9 4.5 930456 92400 ? Ssl 18:21 0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid
apache 10557 0.0 3.5 558436 72824 ? S 18:21 0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid
apache 10560 0.0 3.5 558084 72680 ? S 18:21 0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid
apache 10561 0.0 3.5 557396 72800 ? S 18:21 0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid
apache 10570 0.0 3.6 636800 73956 ? S 18:21 0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid
vagrant 10603 0.0 0.0 12792 948 pts/0 S+ 18:25 0:00 grep --color=auto reserved_resource_worker
$ sudo systemctl status pulp_workers
● pulp_workers.service - Pulp Celery Workers
Loaded: loaded (/home/vagrant/devel/pulp/server/usr/lib/systemd/system/pulp_workers.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-07-20 18:21:38 UTC; 4min 3s ago
Process: 10495 ExecStop=/usr/bin/python -m pulp.server.async.manage_workers stop (code=exited, status=0/SUCCESS)
Process: 10502 ExecStart=/usr/bin/python -m pulp.server.async.manage_workers start (code=exited, status=0/SUCCESS)
Main PID: 10502 (code=exited, status=0/SUCCESS)
Jul 20 18:21:38 pulp2.dev systemd[1]: Starting Pulp Celery Workers...
Jul 20 18:21:38 pulp2.dev systemd[1]: Started Pulp Celery Workers.
Although when I look at the status output:
$ pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: scheduler@pulp2.dev
_ns: workers
Last Heartbeat: 2017-07-20T18:28:16Z
Messaging Connection:
Connected: False
Versions:
Platform Version: 2.14a1
When I start up qpidd, I see the workers in the status output. But the task I've started looks stuck:
$ sudo systemctl start qpidd
$ pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: scheduler@pulp2.dev
_ns: workers
Last Heartbeat: 2017-07-20T18:30:27Z
_id: resource_manager@pulp2.dev
_ns: workers
Last Heartbeat: 2017-07-20T18:30:29Z
_id: reserved_resource_worker-1@pulp2.dev
_ns: workers
Last Heartbeat: 2017-07-20T18:30:26Z
_id: reserved_resource_worker-3@pulp2.dev
_ns: workers
Last Heartbeat: 2017-07-20T18:30:26Z
_id: reserved_resource_worker-0@pulp2.dev
_ns: workers
Last Heartbeat: 2017-07-20T18:30:26Z
_id: reserved_resource_worker-2@pulp2.dev
_ns: workers
Last Heartbeat: 2017-07-20T18:30:26Z
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.14a1
$ pulp-admin tasks list
+----------------------------------------------------------------------+
Tasks
+----------------------------------------------------------------------+
Operations: sync
Resources: zoo (repository)
State: Running
Start Time: 2017-07-20T18:20:51Z
Finish Time: Incomplete
Task Id: 3af0b6c2-e0ed-48c8-9919-a0f5fe0cf832
When I restart the pulp_workers via systemctl, then the task goes away.
I am on celery 4.0.2.
Updated by daviddavis over 7 years ago
The stuck task behavior might be related to:
Updated by daviddavis over 7 years ago
I'm having better luck on celery 3. I see the backtrace on celery 3 (I saw it too on celery 4 actually) but the parent + child process for one of the workers is missing.
Updated by bmbouter over 7 years ago
daviddavis@redhat.com wrote:
I've been trying to reproduce all day but it seems like the workers do start for me when I restart them via systemctl:
I believe the reason systemctl looks started even though it isn't is due to https://pulp.plan.io/issues/2837 The unit systemd started (the chain loader) did start correctly, even if the chain-loaded ones didn't.
Updated by daviddavis over 7 years ago
Upstream PR against celery 4:
https://github.com/celery/celery/pull/4192
After that gets merged, need to:
- Backport the patch against celery 3
- Enable the setting in pulp 2 and 3
Updated by bmbouter over 7 years ago
Is the plan to backport to Celery 3.1 in upstream and then apply that upstream-merged patch onto the EL7 and F25? Those are the only distros [0] which run Celery 3.1. Starting with F26 it's Celery 4. If we aren't going to perform the patching we shouldn't backport to Celery 3.1 because Celery 3.1 is at its terminal release and won't release again.
Similar question for Celery 4, as soon as Pulp expects that feature we need to ensure that the upstream-merged patch is applied to F26 and Rawhide.
Updated by daviddavis over 7 years ago
Talking with the team, we agreed to not fix Celery 3 since it seems like there was no urgency from the stackholders on this. Also, there is a workaround and this problem shouldn't lead to data loss.
For Celery 4, we're going to wait for a new release so I am unsetting the sprint and moving this to the backlog. Here's a diff that we need to enable this in pulp 2:
diff --git a/server/pulp/server/async/app.py b/server/pulp/server/async/app.py
index c0c577266..76b8cb44e 100644
--- a/server/pulp/server/async/app.py
+++ b/server/pulp/server/async/app.py
@@ -113,7 +113,7 @@ class HeartbeatStep(bootsteps.StartStopStep):
celery.steps['consumer'].add(HeartbeatStep)
-@celeryd_after_setup.connect
+@celeryd_after_setup.connect(retry=True)
def initialize_worker(sender, instance, **kwargs):
"""
This function performs all the necessary initialization of the Celery worker.
@@ -167,7 +167,7 @@ def initialize_worker(sender, instance, **kwargs):
get_resource_manager_lock(sender)
-@worker_shutdown.connect
+@worker_shutdown.connect(retry=True)
def shutdown_worker(signal, sender, **kwargs):
"""
Called when a worker is shutdown.
For pulp 3:
diff --git a/platform/pulpcore/tasking/celery_app.py b/platform/pulpcore/tasking/celery_app.py
index 092d4bb28..e08414a87 100644
--- a/platform/pulpcore/tasking/celery_app.py
+++ b/platform/pulpcore/tasking/celery_app.py
@@ -115,7 +115,7 @@ class HeartbeatStep(bootsteps.StartStopStep):
celery.steps['consumer'].add(HeartbeatStep)
-@celeryd_after_setup.connect
+@celeryd_after_setup.connect(retry=True)
def initialize_worker(sender, instance, **kwargs):
"""
This function performs all the necessary initialization of the Celery worker.
@@ -169,7 +169,7 @@ def initialize_worker(sender, instance, **kwargs):
get_resource_manager_lock(sender)
-@worker_shutdown.connect
+@worker_shutdown.connect(retry=True)
def shutdown_worker(signal, sender, **kwargs):
"""
Called when a worker is shutdown.
Updated by daviddavis over 7 years ago
- Status changed from ASSIGNED to NEW
- Assignee deleted (
daviddavis) - Sprint/Milestone deleted (
43)
Updated by daviddavis over 4 years ago
- Status changed from NEW to CLOSED - WONTFIX
- Sprint Candidate set to No