Issue #993: Worker discovery takes 30 seconds when all services restarted at the same time - Pulp

Actions

Send by e-mail Copy link

Issue #993

closed

Worker discovery takes 30 seconds when all services restarted at the same time

Added by bcourt over 9 years ago. Updated over 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

bcourt

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Master

Platform Release:

2.7.0

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Easy Fix, Pulp 2

Sprint:

Quarter:

Description

When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.

When running the script:

[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh 
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]#

The following are the relevant log entries:

May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered

This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Pulp

Agile boards

Custom queries

Issue #993

Worker discovery takes 30 seconds when all services restarted at the same time

Updated by bcourt over 9 years ago

Updated by bmbouter over 9 years ago

Updated by mhrivnak over 9 years ago

Updated by bcourt over 9 years ago

Updated by bcourt over 9 years ago

Added by bcourt over 9 years ago

Added by bcourt over 9 years ago

Updated by bcourt over 9 years ago

Updated by bcourt over 9 years ago

Updated by dkliban@redhat.com over 9 years ago

Updated by Skullman over 9 years ago

Updated by amacdona@redhat.com about 9 years ago

Updated by bmbouter over 5 years ago