Project

Profile

Help

Issue #993

closed

Worker discovery takes 30 seconds when all services restarted at the same time

Added by bcourt almost 9 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Master
Platform Release:
2.7.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Easy Fix, Pulp 2
Sprint:
Quarter:

Description

When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.

When running the script:

[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh 
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]# 

The following are the relevant log entries:

May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered

This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.

Actions #1

Updated by bcourt almost 9 years ago

  • Subject changed from Worker discovery takes 30 seconds when all services restarted to Worker discovery takes 30 seconds when all services restarted at the same time
Actions #2

Updated by bmbouter almost 9 years ago

Here is how I think this should be fixed:

When workers startup they emit a worker-online event. Add this event type to the events the scheduler monitors. The handler of the worker-online event should craft a heartbeat and pass it to worker_watcher.handle_worker_heartbeat similar to the way this code does it.

Actions #3

Updated by mhrivnak almost 9 years ago

  • Platform Release set to 2.7.0
  • Triaged changed from No to Yes
  • Tags Easy Fix added

This was introduced in 2.7.0 alpha.

Actions #4

Updated by bcourt almost 9 years ago

  • Assignee set to bcourt
Actions #5

Updated by bcourt almost 9 years ago

  • Status changed from NEW to POST

Added by bcourt almost 9 years ago

Revision 2e250c92 | View on GitHub

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

Added by bcourt almost 9 years ago

Revision 2e250c92 | View on GitHub

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

Actions #6

Updated by bcourt almost 9 years ago

  • Status changed from POST to MODIFIED
Actions #7

Updated by bcourt almost 9 years ago

  • % Done changed from 0 to 100
Actions #8

Updated by dkliban@redhat.com almost 9 years ago

  • Status changed from MODIFIED to 5
Actions #9

Updated by Skullman almost 9 years ago

  • Status changed from 5 to 6

Verified with pulp-server-2.7.0-0.3.beta.fc21.noarch

Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker '' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13229]: pulp.server.db.connection:INFO: Mongo database for connection is version 2.4.13
[...]
--snip--
[...]
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************

Actions #10

Updated by amacdona@redhat.com over 8 years ago

  • Status changed from 6 to CLOSED - CURRENTRELEASE
Actions #12

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF