Actions
Issue #2613
closedWorker Heartbeats Broken After Broker Reconnect
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.13.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 17
Quarter:
Description
This issue was introduced by the addition of the Celery bootstep code enabling each worker to write their own heartbeats.
https://github.com/pulp/pulp/pull/2922
This is not a Pulp-specific problem. The root cause appears to be that Celery does not properly re-register signal handlers when it rebuilds the worker blueprint after disconnecting and reconnecting from the broker.
To reproduce in Pulp:
1. prestart and smoke test with zoo repo sync
2. sudo systemctl stop qpidd
3. wait 30 seconds
4. observe connection error messages in the logs
4. sudo systemctl start qpidd
5. observe that all processes recover except celerybeat are seen to go offline and no longer have workers records in the database, despite the fact that the connection error messages have now stopped
6. prestart
7. sync zoo repo successfully
You can also test this with pulp smash by running:
workon pulp-smash
python3 -m unittest pulp_smash.tests.rpm.api_v2.test_broker.BrokerTestCase.test_broker_reconnect
To reproduce this generically, save the following as a python file and run with the command ```celery worker -A <file_name>.app```
from celery import Celery
from celery import bootsteps
class Reproducer(bootsteps.StartStopStep):
requires = ('celery.worker.components:Timer', )
def __init__(self, parent, **kwargs):
# here we can prepare the Worker/Consumer object
# in any way we want, set attribute defaults, and so on.
print('{0!r} is in init'.format(parent))
def start(self, worker):
self.timer_ref = worker.timer.call_repeatedly(
5,
self.do_work,
(worker, ),
priority=10,
)
def do_work(self, worker):
print('{0!r} heartbeat'.format(worker))
def stop(self, parent):
print('{0!r} is stopping'.format(parent))
def shutdown(self, parent):
print('{0!r} is shutting down'.format(parent))
app = Celery(broker='qpid://')
app.steps['worker'].add(Reproducer)
The same symptoms occur when using the RabbitMQ broker instead of Qpid, but with different error messages.
Actions
Fixes issue w/ worker heartbeats on broker failure
Starts Pulp worker heartbeat reporting when the Celery consumer starts rather than when the Celery worker starts, so that worker heartbeat reporting is resumed after a disconnection and subsequent reconnection from the broker.
closes #2613 https://pulp.plan.io/issues/2613