Story #2509
closedPulp process failure detection and any failover should occur within 30 seconds
100%
Description
Problem¶
Pulp's failure detection and failover currently takes a long time. This consists of two primary areas: (1) marking workers are dead and (2) pulp_celerybeat and resource_manager hot-spare failover.
The current timings are:
- worker heartbeat is 30 seconds
- celerybeat heartbeat is 90 seconds
- worker ageout time is 300 seconds
- celerybeat lock ageout time is 200 seconds
- resource manager lock check is 60 seconds
This means that when a worker process dies it could take 300 - 390s for it to be considered dead. It also means that it takes 200 - 290 seconds for celerybeat to failover and 300 - 360s seconds for the resource_manager to failover.
Solution¶
This time should be shorter. We should declare Pulp workers to be dead if they have been missing for 30 seconds. We should also have both the pulp_celerybeat and pulp_resource_manager failover within 30 seconds.
We can do this by updating the worker heartbeat to 5 seconds, the celerybeat heartbeat to 5 seconds and the worker ageout time to 25 seconds.
This would mean that a worker that has not checked in for 5 heartbeats (25s) would be considered missing the next time celerybeat checks (25s-30s after the last time the worker checked in)
In addition we need to update the current logic of the resource manager lock failover to match with celerybeat's in order to ensure a 30s failover (see comment 5 for details)
The proposed timings are:
- worker heartbeat 5s
- celerybeat heartbeat 5s
- worker ageout time 25s
- celerybeat lock ageout time 25 s
- resource manager lock heartbeat 5s
Updated by bizhang almost 8 years ago
- Subject changed from As a User I would like pulp celerybeat timeout time to be changed to 60s to Pulp workers should be considered dead if missing for 30 seconds
- Description updated (diff)
- Tags deleted (
Easy Fix)
Updated by bmbouter almost 8 years ago
- Subject changed from Pulp workers should be considered dead if missing for 30 seconds to Pulp process failure detection and any failover should occur within 30 seconds
- Description updated (diff)
Updated by bmbouter almost 8 years ago
Two questions:
1) with the proposed timings, for pulp_celerybeat failover, is it that failover occurs within 25 - 30 seconds. 25 being from the celerybeat lock ageout time of 25s and then 5 seconds from the tick?
2) What is the upper and lower bound of failover for pulp_resource_manager and what timings create those bounds?
Updated by bizhang almost 8 years ago
- Description updated (diff)
- yep, the 2nd celery processes is also ticking once every 5 seconds, if it sees the first has not checked in for 25 seconds it will failover. The bound is 25-30s
- The resource manager wakes up every resource manager heartbeat [0] and checks if it can acquire the lock. So actually it would be sufficient to set this time to 30s to guarantee a 30s resource manager failover. The bound is 0-30s
[0] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L144
Updated by bizhang almost 8 years ago
- Description updated (diff)
I stand corrected, the resource_manager timeout range is 30-60s since it would wait a heartbeat after the resource_manager lock gets removed.
As a part of this story we should revamp the get_resource_manager_lock to match what we have done with celerybeat [0]:
We need to add a time stamp to the ResourceManagerLock [1], and move the lock timeout check from [2] to [3]. Resource manager heartbeat should be set to 5s and ageout time should be 25s (for a range of 25s-30s)
[0] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/scheduler.py#L261
[1] https://github.com/pulp/pulp/blob/master/server/pulp/server/db/model/__init__.py#L1009
[2] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/tasks.py#L261
[3] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L103
Updated by bmbouter almost 8 years ago
- Description updated (diff)
What do you think about consolidating these contants? So many of their values will be there same.
Updated by bizhang almost 8 years ago
yes! Ideally we should have one heartbeat constant and one timeout time constant.
Updated by bmbouter almost 8 years ago
- Groomed changed from No to Yes
- Sprint Candidate changed from No to Yes
This looks really great. I'm grooming it.
Updated by bmbouter almost 8 years ago
- Sprint/Milestone set to 31
Per IRC convo, adding to the current sprint.
Updated by dalley almost 8 years ago
- Status changed from ASSIGNED to POST
Updated by bmbouter almost 8 years ago
The test plan has moved here: https://github.com/PulpQE/pulp-smash/issues/474
Added by dalley almost 8 years ago
Added by dalley almost 8 years ago
Revision f9355d21 | View on GitHub
Reduced heartbeat/timeout intervals
Simplified and reduced the timings so that all failure detection occurs within 30 seconds. Changed resource_manager to use a timestamp-based locking mechanism.
Updated by dalley almost 8 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|f9355d21559fa880fd7654134d8f34dcf6b85acf.
Updated by semyers almost 8 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Reduced heartbeat/timeout intervals
Simplified and reduced the timings so that all failure detection occurs within 30 seconds. Changed resource_manager to use a timestamp-based locking mechanism.
closes #2509 https://pulp.plan.io/issues/2509