Project

Profile

Help

Issue #993

Worker discovery takes 30 seconds when all services restarted at the same time

Added by bcourt over 4 years ago. Updated 9 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Severity:
2. Medium
Version:
Master
Platform Release:
2.7.0
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Easy Fix, Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
Yes
Verification Required:
No
Sprint:

Description

When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.

When running the script:

[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh 
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]# 

The following are the relevant log entries:

May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered

This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.

Associated revisions

Revision 2e250c92 View on GitHub
Added by bcourt over 4 years ago

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

Revision 2e250c92 View on GitHub
Added by bcourt over 4 years ago

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

History

#1 Updated by bcourt over 4 years ago

  • Subject changed from Worker discovery takes 30 seconds when all services restarted to Worker discovery takes 30 seconds when all services restarted at the same time

#2 Updated by bmbouter over 4 years ago

Here is how I think this should be fixed:

When workers startup they emit a worker-online event. Add this event type to the events the scheduler monitors. The handler of the worker-online event should craft a heartbeat and pass it to worker_watcher.handle_worker_heartbeat similar to the way this code does it.

#3 Updated by mhrivnak over 4 years ago

  • Platform Release set to 2.7.0
  • Triaged changed from No to Yes
  • Tags Easy Fix added

This was introduced in 2.7.0 alpha.

#4 Updated by bcourt over 4 years ago

  • Assignee set to bcourt

#5 Updated by bcourt over 4 years ago

  • Status changed from NEW to POST

#6 Updated by bcourt over 4 years ago

  • Status changed from POST to MODIFIED

#7 Updated by bcourt over 4 years ago

  • % Done changed from 0 to 100

#8 Updated by dkliban@redhat.com over 4 years ago

  • Status changed from MODIFIED to ON_QA

#9 Updated by Skullman over 4 years ago

  • Status changed from ON_QA to VERIFIED
  • QA Contact set to Skullman

Verified with pulp-server-2.7.0-0.3.beta.fc21.noarch

Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp13068: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp13068: pulp.server.async.worker_watcher:INFO: New worker '' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp13068: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp13068: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp13068: pulp.server.async.worker_watcher:INFO: New worker '' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp13229: pulp.server.db.connection:INFO: Mongo database for connection is version 2.4.13
[...]
--snip--
[...]
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp13230: pulp.server.webservices.application:INFO: *****************************************************
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp13230: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp13230: pulp.server.webservices.application:INFO: *****************************************************

#10 Updated by amacdona@redhat.com about 4 years ago

  • Status changed from VERIFIED to CLOSED - CURRENTRELEASE

#11 Updated by pulpbot almost 3 years ago

  • Verified changed from No to Yes

#12 Updated by bmbouter 9 months ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF