Project

Profile

Help

Issue #2613

closed

Worker Heartbeats Broken After Broker Reconnect

Added by bmbouter about 7 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.13.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 17
Quarter:

Description

This issue was introduced by the addition of the Celery bootstep code enabling each worker to write their own heartbeats.

https://github.com/pulp/pulp/pull/2922

This is not a Pulp-specific problem. The root cause appears to be that Celery does not properly re-register signal handlers when it rebuilds the worker blueprint after disconnecting and reconnecting from the broker.

To reproduce in Pulp:

1. prestart and smoke test with zoo repo sync
2. sudo systemctl stop qpidd
3. wait 30 seconds
4. observe connection error messages in the logs
4. sudo systemctl start qpidd
5. observe that all processes recover except celerybeat are seen to go offline and no longer have workers records in the database, despite the fact that the connection error messages have now stopped
6. prestart
7. sync zoo repo successfully

You can also test this with pulp smash by running:

workon pulp-smash
python3 -m unittest pulp_smash.tests.rpm.api_v2.test_broker.BrokerTestCase.test_broker_reconnect

To reproduce this generically, save the following as a python file and run with the command ```celery worker -A <file_name>.app```

from celery import Celery
from celery import bootsteps

class Reproducer(bootsteps.StartStopStep):
    requires = ('celery.worker.components:Timer', )

    def __init__(self, parent, **kwargs):
        # here we can prepare the Worker/Consumer object
        # in any way we want, set attribute defaults, and so on.
        print('{0!r} is in init'.format(parent))

    def start(self, worker):
        self.timer_ref = worker.timer.call_repeatedly(
            5,
            self.do_work,
            (worker, ),
            priority=10,
        )

    def do_work(self, worker):
        print('{0!r} heartbeat'.format(worker))

    def stop(self, parent):
        print('{0!r} is stopping'.format(parent))

    def shutdown(self, parent):
        print('{0!r} is shutting down'.format(parent))

app = Celery(broker='qpid://')
app.steps['worker'].add(Reproducer)

The same symptoms occur when using the RabbitMQ broker instead of Qpid, but with different error messages.

Actions #1

Updated by bmbouter about 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
  • Sprint/Milestone set to 34

Adding to current sprint since it will affect Fedora26. I'm taking it as assigned.

Actions #2

Updated by bmbouter about 7 years ago

  • Description updated (diff)
Actions #3

Updated by bizhang about 7 years ago

  • Triaged changed from No to Yes
Actions #5

Updated by dalley about 7 years ago

  • Assignee changed from bmbouter to dalley

Taking this over from bmbouter after discussion

Actions #6

Updated by mhrivnak about 7 years ago

  • Sprint/Milestone changed from 34 to 36
Actions #7

Updated by dalley about 7 years ago

  • Subject changed from Celerybeat reconnect support broken for Celery4+Kombu4 to Broker reconnect support broken
  • Description updated (diff)
  • Triaged changed from Yes to No

I'm untriaging this and updating with new information. Should this block 2.13.z?

Actions #8

Updated by dalley about 7 years ago

  • Description updated (diff)
Actions #9

Updated by Ichimonji10 about 7 years ago

This is a regression from 2.12. I'm going to be cautious and mark it as a blocker for 2.13.

Actions #10

Updated by bizhang about 7 years ago

  • Priority changed from Normal to High
  • Triaged changed from No to Yes
Actions #12

Updated by dalley about 7 years ago

  • Subject changed from Broker reconnect support broken to Pulp breaks on broker reconnect
  • Description updated (diff)
Actions #13

Updated by bmbouter about 7 years ago

  • Subject changed from Pulp breaks on broker reconnect to Worker Heartbeats Broken After Broker Reconnect

Added by dalley about 7 years ago

Revision 1cc4fc10 | View on GitHub

Fixes issue w/ worker heartbeats on broker failure

Starts Pulp worker heartbeat reporting when the Celery consumer starts rather than when the Celery worker starts, so that worker heartbeat reporting is resumed after a disconnection and subsequent reconnection from the broker.

closes #2613 https://pulp.plan.io/issues/2613

Added by dalley about 7 years ago

Revision 1cc4fc10 | View on GitHub

Fixes issue w/ worker heartbeats on broker failure

Starts Pulp worker heartbeat reporting when the Celery consumer starts rather than when the Celery worker starts, so that worker heartbeat reporting is resumed after a disconnection and subsequent reconnection from the broker.

closes #2613 https://pulp.plan.io/issues/2613

Actions #14

Updated by bmbouter about 7 years ago

  • Status changed from ASSIGNED to POST
Actions #15

Updated by dalley about 7 years ago

  • Status changed from POST to MODIFIED
Actions #16

Updated by pcreech about 7 years ago

  • Platform Release set to 2.13.0
Actions #17

Updated by pcreech about 7 years ago

  • Status changed from MODIFIED to 5
Actions #18

Updated by pthomas@redhat.com about 7 years ago

Smash tests have passed for this issue.

Actions #19

Updated by pcreech almost 7 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #20

Updated by bmbouter about 6 years ago

  • Sprint set to Sprint 17
Actions #21

Updated by bmbouter about 6 years ago

  • Sprint/Milestone deleted (36)
Actions #22

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF