Project

Profile

Help

Issue #1801

Pulp celery_beat and resource_manager are running, but logs say they are not running

Added by bmbouter over 3 years ago. Updated 8 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Severity:
3. High
Version:
2.8.0
Platform Release:
2.8.3
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
Yes
Verification Required:
No
Sprint:
Sprint 1

Description

After some unknown amount of time Pulp infrastructure processes appear to die and we receive these messages in the journal / logs:

pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.
pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.

A restart resolves the issue but restarting shouldn't be required for normal operation

Associated revisions

Revision c54adba5 View on GitHub
Added by bmbouter over 3 years ago

Adds patch to python-kombu to fix pulp_celerybeat deadlock

closes #1801
https://pulp.plan.io/issues/1801

Revision c54adba5 View on GitHub
Added by bmbouter over 3 years ago

Adds patch to python-kombu to fix pulp_celerybeat deadlock

closes #1801
https://pulp.plan.io/issues/1801

Revision c54adba5 View on GitHub
Added by bmbouter over 3 years ago

Adds patch to python-kombu to fix pulp_celerybeat deadlock

closes #1801
https://pulp.plan.io/issues/1801

History

#1 Updated by mhrivnak over 3 years ago

  • Sprint/Milestone set to 19

#2 Updated by bmbouter over 3 years ago

I reproduced this in my environment, and pulp_celerybeat appears to be deadlocking in the kombu transport. A gdb trace of a deadlocked pulp_celerybeat process shows the thread which processes event callbacks of incoming heartbeat messages is halted at this line. See the GDB py-list output:

Thread 5 (Thread 0x7f737da33700 (LWP 6551)):
1433                    'The Python package "qpid.messaging" is missing. Install it '
1434                    'with your package manager. You can also try `pip install '
1435                    'qpid-python`.')
1436    
1437        def _qpid_message_ready_handler(self, session):
>1438            os.write(self._w, '0')
1439    
1440        def _qpid_async_exception_notify_handler(self, obj_with_exception, exc):
1441            os.write(self._w, 'e')
1442    
1443        def on_readable(self, connection, loop):

That line corresponds with this line in the kombu code: https://github.com/celery/kombu/blob/93f6606e0a758c9cffb9b3c2ef6a239ed7027309/kombu/transport/qpid.py#L1474

That os.write call is the point of deadlock. I don't yet understand why it is deadlocking, but it is likely a thread safety issue around that pipe. The investigation continues.

#3 Updated by bmbouter over 3 years ago

The root cause is identified, and I filed it in the Kombu upstream issue tracker. https://github.com/celery/kombu/issues/577

I'll be fixing it upstream and then we'll cherry pick that commit as a patch to the version of python-kombu that Pulp carries along with the version in Rawhide.

#4 Updated by rbarlow over 3 years ago

On Thursday, March 31, 2016 9:17:01 PM EDT you wrote:

I'll be fixing it upstream and then we'll cherry pick that commit as a

patch

to the version of python-kombu that Pulp carries along with the version

in

Rawhide.

Consider trying to get the patch into Fedora 24 as well so we don't have
this problem there. Thanks!

#5 Updated by bmbouter over 3 years ago

rbarlow wrote:

On Thursday, March 31, 2016 9:17:01 PM EDT you wrote:

I'll be fixing it upstream and then we'll cherry pick that commit as a

patch

to the version of python-kombu that Pulp carries along with the version

in

Rawhide.

Consider trying to get the patch into Fedora 24 as well so we don't have
this problem there. Thanks!

Oh yes I will do this. I forgot Fedora 24 had branched. I'll submit the update to both Rawhide and F24.

#6 Updated by mhrivnak over 3 years ago

  • Triaged changed from No to Yes

#7 Updated by bmbouter over 3 years ago

This commit needs to be cherry picked into the version we carry https://github.com/celery/kombu/commit/277309f47a713a31885248b78df45e41d8d5e490.

This regression was introduced with kombu 3.0.33. This fix needs to be on pulp-dev and newer branches. No existing 2.7 users use 3.0.33 so we can fix it in 2.7-dev and not have to make a new 2.7 release to make the fix available to existing users. The fix will be included with 2.8.2 from the merge forward to master.

#8 Updated by bmbouter over 3 years ago

  • Status changed from ASSIGNED to POST

#9 Updated by dgregor@redhat.com over 3 years ago

  • Version set to 2.8.0

#11 Updated by bmbouter over 3 years ago

  • Private changed from No to Yes

#12 Updated by bmbouter over 3 years ago

  • Private changed from Yes to No

#14 Updated by pthomas@redhat.com over 3 years ago

Before updating kombu


[root@ibm-x3550m3-12 ~]# rpm -qa |grep kombu
python-kombu-3.0.33-4.pulp.el7.noarch
[root@ibm-x3550m3-12 ~]# 

[root@ibm-x3550m3-12 ~]# sudo qpid-stat  -q |grep  celeryev
  celeryev.223a4cfb-e1bd-4f6e-b146-0198d295e33a                                         Y              20.4k  86.0k  65.5k   18.0m  75.5m    57.6m        1     2
[root@ibm-x3550m3-12 ~]# journalctl -f -l
-- Logs begin at Mon 2016-04-04 21:51:34 CEST. --
Apr 05 13:55:02 ibm-x3550m3-12.lab.eng.brq.redhat.com pulp[32000]: pulp.server.async.scheduler:ERROR: There are 0 pulp_resource_manager processes running. Pulp will not operate correctly without at least one pulp_resource_mananger process running.
Apr 05 13:55:02 ibm-x3550m3-12.lab.eng.brq.redhat.com pulp[32000]: pulp.server.async.scheduler:ERROR: There are 0 pulp_celerybeat processes running. Pulp will not operate correctly without at least one pulp_celerybeat process running.

#15 Updated by bmbouter over 3 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#16 Updated by pthomas@redhat.com over 3 years ago

Verified that msgIn & msgOut are the same and msgOut doesn't stop after 65k

[root@pulp-el7 ~]# rpm -qa |grep kombu
python-kombu-3.0.33-5.pulp.el7.noarch
[root@pulp-el7 ~]# sudo qpid-stat  -q |grep  celeryev
Queues
  queue                                                                            dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================================================
 celeryev.9631492f-a29e-4bdc-b843-23911d505f2d                                         Y                 0   145k   145k      0    128m     128m        1     2

[root@pulp-el6 ~]# rpm -qa |grep kombu
python-kombu-3.0.33-5.pulp.el6.noarch
[root@pulp-el6 ~]# 

Queues
  queue                                                                            dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =========================================================================================================================================================
 celeryev.0caa15b8-8829-441f-8ed2-231cd34a94dd                                                   Y                 0   156k   156k      0    142m     142m        1     2

#17 Updated by bmbouter over 3 years ago

The patch has been applied in rawhide and is currently available.
I've submitted an update to F24 also here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-ec038bbf19

#18 Updated by semyers over 3 years ago

  • Platform Release changed from 2.8.2 to 2.8.3

#26 Updated by semyers over 3 years ago

  • Status changed from MODIFIED to ON_QA

#28 Updated by pthomas@redhat.com over 3 years ago

  • Status changed from ON_QA to VERIFIED

#29 Updated by semyers over 3 years ago

  • Status changed from VERIFIED to CLOSED - CURRENTRELEASE

#30 Updated by pulpbot almost 3 years ago

  • Verified changed from No to Yes

#31 Updated by bmbouter almost 2 years ago

  • Sprint set to Sprint 1

#32 Updated by bmbouter almost 2 years ago

  • Sprint/Milestone deleted (19)

#33 Updated by bmbouter 8 months ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF