Project

Profile

Help

Story #2509

closed

Pulp process failure detection and any failover should occur within 30 seconds

Added by bizhang almost 8 years ago. Updated over 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.12.0
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 13
Quarter:

Description

Problem

Pulp's failure detection and failover currently takes a long time. This consists of two primary areas: (1) marking workers are dead and (2) pulp_celerybeat and resource_manager hot-spare failover.

The current timings are:

  • worker heartbeat is 30 seconds
  • celerybeat heartbeat is 90 seconds
  • worker ageout time is 300 seconds
  • celerybeat lock ageout time is 200 seconds
  • resource manager lock check is 60 seconds

This means that when a worker process dies it could take 300 - 390s for it to be considered dead. It also means that it takes 200 - 290 seconds for celerybeat to failover and 300 - 360s seconds for the resource_manager to failover.

Solution

This time should be shorter. We should declare Pulp workers to be dead if they have been missing for 30 seconds. We should also have both the pulp_celerybeat and pulp_resource_manager failover within 30 seconds.

We can do this by updating the worker heartbeat to 5 seconds, the celerybeat heartbeat to 5 seconds and the worker ageout time to 25 seconds.

This would mean that a worker that has not checked in for 5 heartbeats (25s) would be considered missing the next time celerybeat checks (25s-30s after the last time the worker checked in)

In addition we need to update the current logic of the resource manager lock failover to match with celerybeat's in order to ensure a 30s failover (see comment 5 for details)

The proposed timings are:

  • worker heartbeat 5s
  • celerybeat heartbeat 5s
  • worker ageout time 25s
  • celerybeat lock ageout time 25 s
  • resource manager lock heartbeat 5s

Also available in: Atom PDF