Story #2509: Pulp process failure detection and any failover should occur within 30 seconds - Pulp

Actions

Send by e-mail Copy link

Story #2509

closed

Pulp process failure detection and any failover should occur within 30 seconds

Added by bizhang over 7 years ago. Updated about 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

dalley

Category:

Sprint/Milestone:

Start date:

Due date:

% Done:

100%

Estimated time:

Platform Release:

2.12.0

Groomed:

Yes

Sprint Candidate:

Yes

Tags:

Pulp 2

Sprint:

Sprint 13

Quarter:

Description

Problem¶

Pulp's failure detection and failover currently takes a long time. This consists of two primary areas: (1) marking workers are dead and (2) pulp_celerybeat and resource_manager hot-spare failover.

The current timings are:

worker heartbeat is 30 seconds
celerybeat heartbeat is 90 seconds
worker ageout time is 300 seconds
celerybeat lock ageout time is 200 seconds
resource manager lock check is 60 seconds

This means that when a worker process dies it could take 300 - 390s for it to be considered dead. It also means that it takes 200 - 290 seconds for celerybeat to failover and 300 - 360s seconds for the resource_manager to failover.

Solution¶

This time should be shorter. We should declare Pulp workers to be dead if they have been missing for 30 seconds. We should also have both the pulp_celerybeat and pulp_resource_manager failover within 30 seconds.

We can do this by updating the worker heartbeat to 5 seconds, the celerybeat heartbeat to 5 seconds and the worker ageout time to 25 seconds.

This would mean that a worker that has not checked in for 5 heartbeats (25s) would be considered missing the next time celerybeat checks (25s-30s after the last time the worker checked in)

In addition we need to update the current logic of the resource manager lock failover to match with celerybeat's in order to ensure a 30s failover (see comment 5 for details)

The proposed timings are:

worker heartbeat 5s
celerybeat heartbeat 5s
worker ageout time 25s
celerybeat lock ageout time 25 s
resource manager lock heartbeat 5s

Actions

Project

Profile

Help

Pulp

Agile boards

Custom queries

Story #2509

Pulp process failure detection and any failover should occur within 30 seconds

Problem¶

Solution¶

Updated by bizhang over 7 years ago

Updated by bmbouter over 7 years ago

Updated by bmbouter over 7 years ago

Updated by bizhang over 7 years ago

Updated by bizhang over 7 years ago

Updated by bizhang over 7 years ago

Updated by bmbouter over 7 years ago

Updated by bizhang over 7 years ago

Updated by bizhang over 7 years ago

Updated by bmbouter over 7 years ago

Updated by bmbouter over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by bmbouter over 7 years ago

Added by dalley over 7 years ago

Added by dalley over 7 years ago

Updated by dalley over 7 years ago

Updated by semyers over 7 years ago

Updated by semyers over 7 years ago

Updated by semyers over 7 years ago

Updated by bmbouter over 6 years ago

Updated by bmbouter over 6 years ago

Updated by bmbouter about 5 years ago