Project

Profile

Help

Issue #2664

Workers Canceling Tasks on Startup Fail if the Broker is Down

Added by bmbouter over 2 years ago. Updated 6 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Severity:
2. Medium
Version:
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
Yes
Sprint Candidate:
Tags:
Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

To reproduce:

Create a 'zoo' repo that syncs from the zoo fixture repo

pulp-admin rpm repo create --repo-id tester --feed https://repos.fedorapeople.org/repos/pulp/pulp/fixtures/rpm/

Sync 'zoo' manually once to see it working. Sync it again, and just when the task begins run:

sudo pkill -9 -f reserved_resource_worker

Then stop qpidd with sudo systemctl stop qpidd

Then restart the pulp_workers sudo systemctl restart pulp_workers

Observe the traceback:

Mar 24 17:25:50 dev pulp[10046]: pulp.server.async.tasks:INFO: Cleaning up shutdown worker 'reserved_resource_worker-0@dev'.
Mar 24 17:25:50 dev celery[10046]: Traceback (most recent call last):
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib64/python2.7/multiprocessing/util.py", line 274, in _run_finalizers
Mar 24 17:25:50 dev celery[10046]:     finalizer()
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib64/python2.7/multiprocessing/util.py", line 207, in __call__
Mar 24 17:25:50 dev celery[10046]:     res = self._callback(*self._args, **self._kwargs)
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/celery/worker/__init__.py", line 201, in _send_worker_shutdown
Mar 24 17:25:50 dev celery[10046]:     signals.worker_shutdown.send(sender=self)
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 166, in send
Mar 24 17:25:50 dev celery[10046]:     response = receiver(signal=self, sender=sender, **named)
Mar 24 17:25:50 dev celery[10046]:   File "/home/vagrant/devel/pulp/server/pulp/server/async/app.py", line 194, in shutdown_worker
Mar 24 17:25:50 dev celery[10046]:     tasks._delete_worker(sender.hostname, normal_shutdown=True)
Mar 24 17:25:50 dev celery[10046]:   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 273, in _delete_worker
Mar 24 17:25:50 dev celery[10046]:     cancel(task_status['task_id'])
Mar 24 17:25:50 dev celery[10046]:   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 642, in cancel
Mar 24 17:25:50 dev celery[10046]:     controller.revoke(task_id, terminate=True)
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 172, in revoke
Mar 24 17:25:50 dev celery[10046]:     'signal': signal}, **kwargs)
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/celery/app/control.py", line 316, in broadcast
Mar 24 17:25:50 dev celery[10046]:     limit, callback, channel=channel,
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/kombu/pidbox.py", line 283, in _broadcast
Mar 24 17:25:50 dev celery[10046]:     chan = channel or self.connection.default_channel
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel
Mar 24 17:25:50 dev celery[10046]:     self.connection
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection
Mar 24 17:25:50 dev celery[10046]:     self._connection = self._establish_connection()
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection
Mar 24 17:25:50 dev celery[10046]:     conn = self.transport.establish_connection()
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1600, in establish_connection
Mar 24 17:25:50 dev celery[10046]:     conn = self.Connection(**opts)
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 1261, in __init__
Mar 24 17:25:50 dev celery[10046]:     self._qpid_conn = establish(**self.connection_options)
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 112, in establish
Mar 24 17:25:50 dev celery[10046]:     conn.open(timeout=timeout)
Mar 24 17:25:50 dev celery[10046]:   File "<string>", line 6, in open
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 323, in open
Mar 24 17:25:50 dev celery[10046]:     self.attach(timeout=timeout)
Mar 24 17:25:50 dev celery[10046]:   File "<string>", line 6, in attach
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 341, in attach
Mar 24 17:25:50 dev celery[10046]:     if not self._ewait(lambda: self._transport_connected and not self._unlinked(), timeout=timeout):
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 274, in _ewait
Mar 24 17:25:50 dev celery[10046]:     self.check_error()
Mar 24 17:25:50 dev celery[10046]:   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 267, in check_error
Mar 24 17:25:50 dev celery[10046]:     raise e
Mar 24 17:25:50 dev celery[10046]: ConnectError: [Errno 111] Connection refused

Then start qpidd again sudo systemctl start qpidd

Note that the pulp_workers never started. They emit the traceback and shut back down.

History

#2 Updated by ttereshc over 2 years ago

  • Triaged changed from No to Yes

#3 Updated by dalley over 2 years ago

Now that #2613 is fixed, I can conclude that #2613 is not the cause of this issue. I was able to reproduce it with the fix applied.

#4 Updated by bmbouter over 2 years ago

  • Groomed changed from No to Yes

#5 Updated by ttereshc over 2 years ago

  • Sprint Candidate changed from No to Yes

#7 Updated by mhrivnak over 2 years ago

  • Sprint/Milestone set to 40

#8 Updated by dalley over 2 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley

#9 Updated by bmbouter over 2 years ago

  • Description updated (diff)

The impact of this bug claims that the workers are left in a running-but-permamently-broken state with the claim in the last sentence "The workers are in a dirty state". Based on the recent testing we did I think that's incorrect. I think actually: the traceback is emitted but the workers actually fail to start. That isn't great but it is less severe than having a worker that is running but "broken".

Also I'm removing the language suggesting this root cause is realted to Issue #2613 because that is closed+fixed on the version we are testing against, but we are still experiencing the same issue.

#10 Updated by bmbouter over 2 years ago

  • Description updated (diff)

#11 Updated by mhrivnak over 2 years ago

  • Sprint/Milestone changed from 40 to 41

#12 Updated by dalley over 2 years ago

  • Status changed from ASSIGNED to NEW
  • Sprint/Milestone deleted (41)

After discussion with @bmbouter, we decided to remove from the sprint. It's not very important, it's an upstream issue with an unresponsive upstream, and it's been in stasis for 3 weeks already while trying to get more info.

#13 Updated by dalley over 2 years ago

  • Private changed from No to Yes

#14 Updated by dalley over 2 years ago

  • Private changed from Yes to No

#16 Updated by dalley over 2 years ago

  • Assignee deleted (dalley)

#17 Updated by bmbouter about 2 years ago

In some cases, if the broker is down a worker will be unable to start. Specifically if the worker was shutdown while processing tasks it will try to cancel those tasks on startup. If the first try fails it shuts back down.

Upstream has outlined the fix they would accept and it is very straightforward. It would add a new option to celery's signal.connect called retry which if we set to True would cause automatic retry while the broker is down. If we did this the next steps would be:

1. Make the upstream patch providing the 'retry' option to signal.connect and get it merged to the upstream 4.0 branch
2. Backport the issue to the 3.1 branch which should be a simple cherry pick without modification
3. update/patch the Fedora versions and the version that we build in the deps/ folder
4. Add retry=True to these two signals: https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L116-L183

#18 Updated by bmbouter about 2 years ago

  • Sprint/Milestone set to 41

Adding back to sprint after IRC discussion and input from upstream.

#19 Updated by daviddavis about 2 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#20 Updated by daviddavis about 2 years ago

I've been trying to reproduce all day but it seems like the workers do start for me when I restart them via systemctl:

$ sudo systemctl restart pulp_workers
$ ps aux | grep reserved_resource_worker
apache   10506  0.9  4.5 931728 92784 ?        Ssl  18:21   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid
apache   10509  0.9  4.5 931496 92292 ?        Ssl  18:21   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid
apache   10514  0.9  4.4 931144 92120 ?        Ssl  18:21   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid
apache   10519  0.9  4.5 930456 92400 ?        Ssl  18:21   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid
apache   10557  0.0  3.5 558436 72824 ?        S    18:21   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid
apache   10560  0.0  3.5 558084 72680 ?        S    18:21   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-2@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-2.pid
apache   10561  0.0  3.5 557396 72800 ?        S    18:21   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-3@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-3.pid
apache   10570  0.0  3.6 636800 73956 ?        S    18:21   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid
vagrant  10603  0.0  0.0  12792   948 pts/0    S+   18:25   0:00 grep --color=auto reserved_resource_worker

$ sudo systemctl status pulp_workers                                                                                                                                                 
‚óŹ pulp_workers.service - Pulp Celery Workers
   Loaded: loaded (/home/vagrant/devel/pulp/server/usr/lib/systemd/system/pulp_workers.service; enabled; vendor preset: disabled)
   Active: active (exited) since Thu 2017-07-20 18:21:38 UTC; 4min 3s ago
  Process: 10495 ExecStop=/usr/bin/python -m pulp.server.async.manage_workers stop (code=exited, status=0/SUCCESS)
  Process: 10502 ExecStart=/usr/bin/python -m pulp.server.async.manage_workers start (code=exited, status=0/SUCCESS)
 Main PID: 10502 (code=exited, status=0/SUCCESS)

Jul 20 18:21:38 pulp2.dev systemd[1]: Starting Pulp Celery Workers...
Jul 20 18:21:38 pulp2.dev systemd[1]: Started Pulp Celery Workers.

Although when I look at the status output:

$ pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            scheduler@pulp2.dev
  _ns:            workers
  Last Heartbeat: 2017-07-20T18:28:16Z
Messaging Connection:  
  Connected: False
Versions:              
  Platform Version: 2.14a1

When I start up qpidd, I see the workers in the status output. But the task I've started looks stuck:

$ sudo systemctl start qpidd
$ pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            scheduler@pulp2.dev
  _ns:            workers
  Last Heartbeat: 2017-07-20T18:30:27Z
  _id:            resource_manager@pulp2.dev
  _ns:            workers
  Last Heartbeat: 2017-07-20T18:30:29Z
  _id:            reserved_resource_worker-1@pulp2.dev
  _ns:            workers
  Last Heartbeat: 2017-07-20T18:30:26Z
  _id:            reserved_resource_worker-3@pulp2.dev
  _ns:            workers
  Last Heartbeat: 2017-07-20T18:30:26Z
  _id:            reserved_resource_worker-0@pulp2.dev
  _ns:            workers
  Last Heartbeat: 2017-07-20T18:30:26Z
  _id:            reserved_resource_worker-2@pulp2.dev
  _ns:            workers
  Last Heartbeat: 2017-07-20T18:30:26Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.14a1

$ pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  sync
Resources:   zoo (repository)
State:       Running
Start Time:  2017-07-20T18:20:51Z
Finish Time: Incomplete
Task Id:     3af0b6c2-e0ed-48c8-9919-a0f5fe0cf832

When I restart the pulp_workers via systemctl, then the task goes away.

I am on celery 4.0.2.

#21 Updated by daviddavis about 2 years ago

The stuck task behavior might be related to:

https://pulp.plan.io/issues/2835

#22 Updated by daviddavis about 2 years ago

I'm having better luck on celery 3. I see the backtrace on celery 3 (I saw it too on celery 4 actually) but the parent + child process for one of the workers is missing.

#23 Updated by mhrivnak about 2 years ago

  • Sprint/Milestone changed from 41 to 42

#25 Updated by bmbouter about 2 years ago

wrote:

I've been trying to reproduce all day but it seems like the workers do start for me when I restart them via systemctl:

I believe the reason systemctl looks started even though it isn't is due to https://pulp.plan.io/issues/2837 The unit systemd started (the chain loader) did start correctly, even if the chain-loaded ones didn't.

#26 Updated by mhrivnak about 2 years ago

  • Sprint/Milestone changed from 42 to 43

#27 Updated by daviddavis about 2 years ago

Upstream PR against celery 4:

https://github.com/celery/celery/pull/4192

After that gets merged, need to:

  • Backport the patch against celery 3
  • Enable the setting in pulp 2 and 3

#28 Updated by bmbouter about 2 years ago

Is the plan to backport to Celery 3.1 in upstream and then apply that upstream-merged patch onto the EL7 and F25? Those are the only distros [0] which run Celery 3.1. Starting with F26 it's Celery 4. If we aren't going to perform the patching we shouldn't backport to Celery 3.1 because Celery 3.1 is at its terminal release and won't release again.

Similar question for Celery 4, as soon as Pulp expects that feature we need to ensure that the upstream-merged patch is applied to F26 and Rawhide.

[0]: https://apps.fedoraproject.org/packages/python-celery/

#29 Updated by daviddavis about 2 years ago

Talking with the team, we agreed to not fix Celery 3 since it seems like there was no urgency from the stackholders on this. Also, there is a workaround and this problem shouldn't lead to data loss.

For Celery 4, we're going to wait for a new release so I am unsetting the sprint and moving this to the backlog. Here's a diff that we need to enable this in pulp 2:


diff --git a/server/pulp/server/async/app.py b/server/pulp/server/async/app.py
index c0c577266..76b8cb44e 100644
--- a/server/pulp/server/async/app.py
+++ b/server/pulp/server/async/app.py
@@ -113,7 +113,7 @@ class HeartbeatStep(bootsteps.StartStopStep):
 celery.steps['consumer'].add(HeartbeatStep)

-@celeryd_after_setup.connect
+@celeryd_after_setup.connect(retry=True)
 def initialize_worker(sender, instance, **kwargs):
     """ 
     This function performs all the necessary initialization of the Celery worker.
@@ -167,7 +167,7 @@ def initialize_worker(sender, instance, **kwargs):
         get_resource_manager_lock(sender)

-@worker_shutdown.connect
+@worker_shutdown.connect(retry=True)
 def shutdown_worker(signal, sender, **kwargs):
     """ 
     Called when a worker is shutdown.

For pulp 3:

diff --git a/platform/pulpcore/tasking/celery_app.py b/platform/pulpcore/tasking/celery_app.py
index 092d4bb28..e08414a87 100644
--- a/platform/pulpcore/tasking/celery_app.py
+++ b/platform/pulpcore/tasking/celery_app.py
@@ -115,7 +115,7 @@ class HeartbeatStep(bootsteps.StartStopStep):
 celery.steps['consumer'].add(HeartbeatStep)

-@celeryd_after_setup.connect
+@celeryd_after_setup.connect(retry=True)
 def initialize_worker(sender, instance, **kwargs):
     """ 
     This function performs all the necessary initialization of the Celery worker.
@@ -169,7 +169,7 @@ def initialize_worker(sender, instance, **kwargs):
         get_resource_manager_lock(sender)

-@worker_shutdown.connect
+@worker_shutdown.connect(retry=True)
 def shutdown_worker(signal, sender, **kwargs):
     """ 
     Called when a worker is shutdown.

#30 Updated by daviddavis about 2 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (daviddavis)
  • Sprint/Milestone deleted (43)

#31 Updated by rchan over 1 year ago

  • Sprint Candidate deleted (Yes)

#32 Updated by bmbouter 6 months ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF