Project

Profile

Help

Issue #993

Worker discovery takes 30 seconds when all services restarted at the same time

Added by bcourt about 5 years ago. Updated over 1 year ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Master
Platform Release:
2.7.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Easy Fix, Pulp 2
Sprint:

Description

When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.

When running the script:

[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh 
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]# 

The following are the relevant log entries:

May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered

This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.

Associated revisions

Revision 2e250c92 View on GitHub
Added by bcourt about 5 years ago

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

Revision 2e250c92 View on GitHub
Added by bcourt about 5 years ago

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

History

#1 Updated by bcourt about 5 years ago

  • Subject changed from Worker discovery takes 30 seconds when all services restarted to Worker discovery takes 30 seconds when all services restarted at the same time

#2 Updated by bmbouter about 5 years ago

Here is how I think this should be fixed:

When workers startup they emit a worker-online event. Add this event type to the events the scheduler monitors. The handler of the worker-online event should craft a heartbeat and pass it to worker_watcher.handle_worker_heartbeat similar to the way this code does it.

#3 Updated by mhrivnak about 5 years ago

  • Platform Release set to 2.7.0
  • Triaged changed from No to Yes
  • Tags Easy Fix added

This was introduced in 2.7.0 alpha.

#4 Updated by bcourt about 5 years ago

  • Assignee set to bcourt

#5 Updated by bcourt about 5 years ago

  • Status changed from NEW to POST

#6 Updated by bcourt about 5 years ago

  • Status changed from POST to MODIFIED

#7 Updated by bcourt about 5 years ago

  • % Done changed from 0 to 100

#8 Updated by dkliban@redhat.com about 5 years ago

  • Status changed from MODIFIED to 5

#9 Updated by Skullman about 5 years ago

  • Status changed from 5 to 6

Verified with pulp-server-2.7.0-0.3.beta.fc21.noarch

Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13229]: pulp.server.db.connection:INFO: Mongo database for connection is version 2.4.13
[...]
--snip--
[...]
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************

#10 Updated by amacdona@redhat.com over 4 years ago

  • Status changed from 6 to CLOSED - CURRENTRELEASE

#12 Updated by bmbouter over 1 year ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF