Issue #1801
closedPulp celery_beat and resource_manager are running, but logs say they are not running
Description
After some unknown amount of time Pulp infrastructure processes appear to die and we receive these messages in the journal / logs:
pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.
pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.
A restart resolves the issue but restarting shouldn't be required for normal operation
Updated by bmbouter almost 7 years ago
I reproduced this in my environment, and pulp_celerybeat appears to be deadlocking in the kombu transport. A gdb trace of a deadlocked pulp_celerybeat process shows the thread which processes event callbacks of incoming heartbeat messages is halted at this line. See the GDB py-list output:
Thread 5 (Thread 0x7f737da33700 (LWP 6551)):
1433 'The Python package "qpid.messaging" is missing. Install it '
1434 'with your package manager. You can also try `pip install '
1435 'qpid-python`.')
1436
1437 def _qpid_message_ready_handler(self, session):
>1438 os.write(self._w, '0')
1439
1440 def _qpid_async_exception_notify_handler(self, obj_with_exception, exc):
1441 os.write(self._w, 'e')
1442
1443 def on_readable(self, connection, loop):
That line corresponds with this line in the kombu code: https://github.com/celery/kombu/blob/93f6606e0a758c9cffb9b3c2ef6a239ed7027309/kombu/transport/qpid.py#L1474
That os.write call is the point of deadlock. I don't yet understand why it is deadlocking, but it is likely a thread safety issue around that pipe. The investigation continues.
Updated by bmbouter almost 7 years ago
The root cause is identified, and I filed it in the Kombu upstream issue tracker. https://github.com/celery/kombu/issues/577
I'll be fixing it upstream and then we'll cherry pick that commit as a patch to the version of python-kombu that Pulp carries along with the version in Rawhide.
Updated by rbarlow almost 7 years ago
On Thursday, March 31, 2016 9:17:01 PM EDT you wrote:
I'll be fixing it upstream and then we'll cherry pick that commit as a
patch
to the version of python-kombu that Pulp carries along with the version
in
Rawhide.
Consider trying to get the patch into Fedora 24 as well so we don't have
this problem there. Thanks!
Updated by bmbouter almost 7 years ago
rbarlow wrote:
On Thursday, March 31, 2016 9:17:01 PM EDT you wrote:
I'll be fixing it upstream and then we'll cherry pick that commit as a
patch
to the version of python-kombu that Pulp carries along with the version
in
Rawhide.
Consider trying to get the patch into Fedora 24 as well so we don't have
this problem there. Thanks!
Oh yes I will do this. I forgot Fedora 24 had branched. I'll submit the update to both Rawhide and F24.
Updated by bmbouter almost 7 years ago
This commit needs to be cherry picked into the version we carry https://github.com/celery/kombu/commit/277309f47a713a31885248b78df45e41d8d5e490.
This regression was introduced with kombu 3.0.33. This fix needs to be on pulp-dev and newer branches. No existing 2.7 users use 3.0.33 so we can fix it in 2.7-dev and not have to make a new 2.7 release to make the fix available to existing users. The fix will be included with 2.8.2 from the merge forward to master.
Updated by bmbouter almost 7 years ago
- Status changed from ASSIGNED to POST
Added by bmbouter almost 7 years ago
Added by bmbouter almost 7 years ago
Adds patch to python-kombu to fix pulp_celerybeat deadlock
Updated by pthomas@redhat.com almost 7 years ago
Before updating kombu
[root@ibm-x3550m3-12 ~]# rpm -qa |grep kombu
python-kombu-3.0.33-4.pulp.el7.noarch
[root@ibm-x3550m3-12 ~]#
[root@ibm-x3550m3-12 ~]# sudo qpid-stat -q |grep celeryev
celeryev.223a4cfb-e1bd-4f6e-b146-0198d295e33a Y 20.4k 86.0k 65.5k 18.0m 75.5m 57.6m 1 2
[root@ibm-x3550m3-12 ~]# journalctl -f -l
-- Logs begin at Mon 2016-04-04 21:51:34 CEST. --
Apr 05 13:55:02 ibm-x3550m3-12.lab.eng.brq.redhat.com pulp[32000]: pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.
Apr 05 13:55:02 ibm-x3550m3-12.lab.eng.brq.redhat.com pulp[32000]: pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.
Updated by bmbouter almost 7 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|c54adba554d157a3ad6fef9ba11d2c0e01595ac7.
Updated by pthomas@redhat.com almost 7 years ago
Verified that msgIn & msgOut are the same and msgOut doesn't stop after 65k
[root@pulp-el7 ~]# rpm -qa |grep kombu
python-kombu-3.0.33-5.pulp.el7.noarch
[root@pulp-el7 ~]# sudo qpid-stat -q |grep celeryev
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================================================
celeryev.9631492f-a29e-4bdc-b843-23911d505f2d Y 0 145k 145k 0 128m 128m 1 2
[root@pulp-el6 ~]# rpm -qa |grep kombu
python-kombu-3.0.33-5.pulp.el6.noarch
[root@pulp-el6 ~]#
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================================================
celeryev.0caa15b8-8829-441f-8ed2-231cd34a94dd Y 0 156k 156k 0 142m 142m 1 2
Updated by bmbouter almost 7 years ago
The patch has been applied in rawhide and is currently available.
I've submitted an update to F24 also here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-ec038bbf19
Updated by semyers almost 7 years ago
- Platform Release changed from 2.8.2 to 2.8.3
Updated by bmbouter almost 7 years ago
pulp-list e-mail about the issue: https://www.redhat.com/archives/pulp-list/2016-April/msg00020.html
Updated by pthomas@redhat.com almost 7 years ago
- Status changed from 5 to 6
Updated by semyers almost 7 years ago
- Status changed from 6 to CLOSED - CURRENTRELEASE
Adds patch to python-kombu to fix pulp_celerybeat deadlock
closes #1801 https://pulp.plan.io/issues/1801