Issue #993
closedWorker discovery takes 30 seconds when all services restarted at the same time
Description
When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.
When running the script:
[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]#
The following are the relevant log entries:
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered
This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.
Updated by bcourt over 9 years ago
- Subject changed from Worker discovery takes 30 seconds when all services restarted to Worker discovery takes 30 seconds when all services restarted at the same time
Updated by bmbouter over 9 years ago
Here is how I think this should be fixed:
When workers startup they emit a worker-online event. Add this event type to the events the scheduler monitors. The handler of the worker-online event should craft a heartbeat and pass it to worker_watcher.handle_worker_heartbeat similar to the way this code does it.
Updated by mhrivnak over 9 years ago
- Platform Release set to 2.7.0
- Triaged changed from No to Yes
- Tags Easy Fix added
This was introduced in 2.7.0 alpha.
Updated by bcourt over 9 years ago
- Status changed from NEW to POST
Added by bcourt over 9 years ago
Added by bcourt over 9 years ago
Revision 2e250c92 | View on GitHub
Fix delayed worker discovery if all the services are restarted at the same time.
fixes #993
Updated by bcourt over 9 years ago
- % Done changed from 0 to 100
Applied in changeset pulp|2e250c92d2bf58a1759cf7931af3976b7cef6e28.
Updated by dkliban@redhat.com over 9 years ago
- Status changed from MODIFIED to 5
Updated by Skullman over 9 years ago
- Status changed from 5 to 6
Verified with pulp-server-2.7.0-0.3.beta.fc21.noarch
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@nec-em24-1.rhts.eng.bos.redhat.com' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-13@nec-em24-1.rhts.eng.bos.redhat.com' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-6@nec-em24-1.rhts.eng.bos.redhat.com' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-8@nec-em24-1.rhts.eng.bos.redhat.com' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-14@nec-em24-1.rhts.eng.bos.redhat.com' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13229]: pulp.server.db.connection:INFO: Mongo database for connection is version 2.4.13
[...]
--snip--
[...]
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************
Updated by amacdona@redhat.com about 9 years ago
- Status changed from 6 to CLOSED - CURRENTRELEASE
Fix delayed worker discovery if all the services are restarted at the same time.
fixes #993