Project

Profile

Help

Story #2509

closed

Pulp process failure detection and any failover should occur within 30 seconds

Added by bizhang over 7 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.12.0
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 13
Quarter:

Description

Problem

Pulp's failure detection and failover currently takes a long time. This consists of two primary areas: (1) marking workers are dead and (2) pulp_celerybeat and resource_manager hot-spare failover.

The current timings are:

  • worker heartbeat is 30 seconds
  • celerybeat heartbeat is 90 seconds
  • worker ageout time is 300 seconds
  • celerybeat lock ageout time is 200 seconds
  • resource manager lock check is 60 seconds

This means that when a worker process dies it could take 300 - 390s for it to be considered dead. It also means that it takes 200 - 290 seconds for celerybeat to failover and 300 - 360s seconds for the resource_manager to failover.

Solution

This time should be shorter. We should declare Pulp workers to be dead if they have been missing for 30 seconds. We should also have both the pulp_celerybeat and pulp_resource_manager failover within 30 seconds.

We can do this by updating the worker heartbeat to 5 seconds, the celerybeat heartbeat to 5 seconds and the worker ageout time to 25 seconds.

This would mean that a worker that has not checked in for 5 heartbeats (25s) would be considered missing the next time celerybeat checks (25s-30s after the last time the worker checked in)

In addition we need to update the current logic of the resource manager lock failover to match with celerybeat's in order to ensure a 30s failover (see comment 5 for details)

The proposed timings are:

  • worker heartbeat 5s
  • celerybeat heartbeat 5s
  • worker ageout time 25s
  • celerybeat lock ageout time 25 s
  • resource manager lock heartbeat 5s
Actions #1

Updated by bizhang over 7 years ago

  • Subject changed from As a User I would like pulp celerybeat timeout time to be changed to 60s to Pulp workers should be considered dead if missing for 30 seconds
  • Description updated (diff)
  • Tags deleted (Easy Fix)
Actions #2

Updated by bmbouter over 7 years ago

  • Subject changed from Pulp workers should be considered dead if missing for 30 seconds to Pulp process failure detection and any failover should occur within 30 seconds
  • Description updated (diff)
Actions #3

Updated by bmbouter over 7 years ago

Two questions:
1) with the proposed timings, for pulp_celerybeat failover, is it that failover occurs within 25 - 30 seconds. 25 being from the celerybeat lock ageout time of 25s and then 5 seconds from the tick?

2) What is the upper and lower bound of failover for pulp_resource_manager and what timings create those bounds?

Actions #4

Updated by bizhang over 7 years ago

  • Description updated (diff)
  1. yep, the 2nd celery processes is also ticking once every 5 seconds, if it sees the first has not checked in for 25 seconds it will failover. The bound is 25-30s
  2. The resource manager wakes up every resource manager heartbeat [0] and checks if it can acquire the lock. So actually it would be sufficient to set this time to 30s to guarantee a 30s resource manager failover. The bound is 0-30s

[0] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L144

Actions #5

Updated by bizhang over 7 years ago

  • Description updated (diff)

I stand corrected, the resource_manager timeout range is 30-60s since it would wait a heartbeat after the resource_manager lock gets removed.

As a part of this story we should revamp the get_resource_manager_lock to match what we have done with celerybeat [0]:
We need to add a time stamp to the ResourceManagerLock [1], and move the lock timeout check from [2] to [3]. Resource manager heartbeat should be set to 5s and ageout time should be 25s (for a range of 25s-30s)

[0] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/scheduler.py#L261
[1] https://github.com/pulp/pulp/blob/master/server/pulp/server/db/model/__init__.py#L1009
[2] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/tasks.py#L261
[3] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L103

Actions #6

Updated by bizhang over 7 years ago

  • Description updated (diff)
Actions #7

Updated by bmbouter over 7 years ago

  • Description updated (diff)

What do you think about consolidating these contants? So many of their values will be there same.

Actions #8

Updated by bizhang over 7 years ago

yes! Ideally we should have one heartbeat constant and one timeout time constant.

Actions #9

Updated by bizhang over 7 years ago

Actions #10

Updated by bmbouter over 7 years ago

  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

This looks really great. I'm grooming it.

Actions #11

Updated by bmbouter over 7 years ago

  • Sprint/Milestone set to 31

Per IRC convo, adding to the current sprint.

Actions #12

Updated by dalley over 7 years ago

  • Assignee set to dalley
Actions #13

Updated by dalley over 7 years ago

  • Status changed from NEW to ASSIGNED
Actions #14

Updated by dalley over 7 years ago

Actions #15

Updated by dalley over 7 years ago

Actions #16

Updated by dalley over 7 years ago

Actions #17

Updated by dalley over 7 years ago

Actions #18

Updated by dalley over 7 years ago

Actions #19

Updated by dalley over 7 years ago

Actions #20

Updated by dalley over 7 years ago

  • Status changed from ASSIGNED to POST
Actions #21

Updated by bmbouter over 7 years ago

Added by dalley over 7 years ago

Revision f9355d21 | View on GitHub

Reduced heartbeat/timeout intervals

Simplified and reduced the timings so that all failure detection occurs within 30 seconds. Changed resource_manager to use a timestamp-based locking mechanism.

closes #2509 https://pulp.plan.io/issues/2509

Added by dalley over 7 years ago

Revision f9355d21 | View on GitHub

Reduced heartbeat/timeout intervals

Simplified and reduced the timings so that all failure detection occurs within 30 seconds. Changed resource_manager to use a timestamp-based locking mechanism.

closes #2509 https://pulp.plan.io/issues/2509

Actions #23

Updated by dalley over 7 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #24

Updated by semyers over 7 years ago

  • Platform Release set to 2.12.0
Actions #25

Updated by semyers over 7 years ago

  • Status changed from MODIFIED to 5
Actions #26

Updated by semyers about 7 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #27

Updated by bmbouter about 6 years ago

  • Sprint set to Sprint 13
Actions #28

Updated by bmbouter about 6 years ago

  • Sprint/Milestone deleted (31)
Actions #29

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF