Issue #993: Worker discovery takes 30 seconds when all services restarted at the same time - Pulp

Actions

Send by e-mail Copy link

Issue #993

closed

Worker discovery takes 30 seconds when all services restarted at the same time

Added by bcourt over 9 years ago. Updated over 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

bcourt

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Master

Platform Release:

2.7.0

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Easy Fix, Pulp 2

Sprint:

Quarter:

Description

When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.

When running the script:

[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh 
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]#

The following are the relevant log entries:

May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered

This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.

Actions

Copy link

Updated by bcourt over 9 years ago

Subject changed from Worker discovery takes 30 seconds when all services restarted to Worker discovery takes 30 seconds when all services restarted at the same time

Actions

Copy link

Updated by bmbouter over 9 years ago

Here is how I think this should be fixed:

When workers startup they emit a worker-online event. Add this event type to the events the scheduler monitors. The handler of the worker-online event should craft a heartbeat and pass it to worker_watcher.handle_worker_heartbeat similar to the way this code does it.

Actions

Copy link

Updated by mhrivnak over 9 years ago

Platform Release set to 2.7.0
Triaged changed from No to Yes
Tags Easy Fix added

This was introduced in 2.7.0 alpha.

Actions

Copy link

Updated by bcourt over 9 years ago

Assignee set to bcourt

Actions

Copy link

Updated by bcourt over 9 years ago

Status changed from NEW to POST

https://github.com/pulp/pulp/pull/1889

Added by bcourt over 9 years ago

Revision 2e250c92 | View on GitHub

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

Added by bcourt over 9 years ago

Revision 2e250c92 | View on GitHub

Fix delayed worker discovery if all the services are restarted at the same time.

fixes #993

Actions

Copy link

Updated by bcourt over 9 years ago

Status changed from POST to MODIFIED

Actions

Copy link

Updated by bcourt over 9 years ago

% Done changed from 0 to 100

Applied in changeset pulp|2e250c92d2bf58a1759cf7931af3976b7cef6e28.

Actions

Copy link

Updated by dkliban@redhat.com over 9 years ago

Status changed from MODIFIED to 5

Actions

Copy link

Updated by Skullman over 9 years ago

Status changed from 5 to 6

Verified with pulp-server-2.7.0-0.3.beta.fc21.noarch

Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@nec-em24-1.rhts.eng.bos.redhat.com' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-13@nec-em24-1.rhts.eng.bos.redhat.com' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-6@nec-em24-1.rhts.eng.bos.redhat.com' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-8@nec-em24-1.rhts.eng.bos.redhat.com' discovered
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13068]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-14@nec-em24-1.rhts.eng.bos.redhat.com' discovere
Jun 29 12:04:50 nec-em24-1.rhts.eng.bos.redhat.com pulp[13229]: pulp.server.db.connection:INFO: Mongo database for connection is version 2.4.13
[...]
--snip--
[...]
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
Jun 29 12:04:51 nec-em24-1.rhts.eng.bos.redhat.com pulp[13230]: pulp.server.webservices.application:INFO: *******************************************************

Actions

Copy link

#10

Updated by amacdona@redhat.com about 9 years ago

Status changed from 6 to CLOSED - CURRENTRELEASE

Actions

Copy link

#12

Updated by bmbouter over 5 years ago

Tags Pulp 2 added

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Pulp

Agile boards

Custom queries

Issue #993

Worker discovery takes 30 seconds when all services restarted at the same time

Updated by bcourt over 9 years ago

Updated by bmbouter over 9 years ago

Updated by mhrivnak over 9 years ago

Updated by bcourt over 9 years ago

Updated by bcourt over 9 years ago

Added by bcourt over 9 years ago

Added by bcourt over 9 years ago

Updated by bcourt over 9 years ago

Updated by bcourt over 9 years ago

Updated by dkliban@redhat.com over 9 years ago

Updated by Skullman over 9 years ago

Updated by amacdona@redhat.com about 9 years ago

Updated by bmbouter over 5 years ago