Project

Profile

Help

Issue #993

closed

Worker discovery takes 30 seconds when all services restarted at the same time

Added by bcourt almost 9 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Master
Platform Release:
2.7.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Easy Fix, Pulp 2
Sprint:
Quarter:

Description

When all of the pulp services are restarted at the same time it often takes 30 seconds before the workers are discovered.

When running the script:

[root@bcourt pulp]# /home/bcourt/bin/restart-pulp.sh 
++ systemctl stop httpd
++ systemctl stop pulp_workers
++ systemctl stop pulp_resource_manager
++ systemctl stop pulp_celerybeat
++ systemctl stop goferd
++ systemctl stop qpidd
++ systemctl start qpidd
++ systemctl start goferd
++ systemctl start pulp_celerybeat
++ systemctl start pulp_resource_manager
++ systemctl start pulp_workers
++ systemctl start httpd
[root@bcourt pulp]# 

The following are the relevant log entries:

May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: The Pulp server has been successfully initialized
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: pulp.server.webservices.application:INFO: *************************************************************
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.qpid.connection:INFO: opened: qpid+tcp://localhost:5672
May 20 13:33:42 bcourt.usersys.redhat.com pulp[8229]: gofer.messaging.adapter.connect:INFO: connected: qpid+tcp://localhost:5672
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-1@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-3@bcourt.usersys.redhat.com' discovered
May 20 13:34:10 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-0@bcourt.usersys.redhat.com' discovered
May 20 13:34:11 bcourt.usersys.redhat.com pulp[8185]: pulp.server.async.worker_watcher:INFO: New worker 'reserved_resource_worker-2@bcourt.usersys.redhat.com' discovered

This looks like a race condition where the initial heartbeat from the worker is missed as celerybeat is still starting up. A simple solution would be to have the workers heartbeat faster initially and then back off once they have communication with celerybeat.

Also available in: Atom PDF