Actions
Issue #993
closedWorker discovery takes 30 seconds when all services restarted at the same time
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Master
Platform Release:
2.7.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Easy Fix, Pulp 2
Sprint:
Quarter:
Description
When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.
When running the script:
[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]#
The following are the relevant log entries:
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered
This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.
Actions
Fix delayed worker discovery if all the services are restarted at the same time.
fixes #993