Project

Profile

Help

Story #2509

Updated by bizhang over 7 years ago

h2. Problem 

 Pulp's failure detection and failover currently takes a long time. This consists of two primary areas: (1) marking workers are dead and (2) celerybeat and resource_manager hot-spare failover. 

 The current timings are: 
 * worker heartbeat is 30 seconds  
 * celerybeat heartbeat is 90 seconds 
 * worker ageout time is 300 seconds 
 For celerybeat and resource_manager hot-spare failover the timings are: 
 * celerybeat lock ageout time is 200 seconds 
 * resource manager lock check is 60 seconds 

 This means that when a worker process dies it could take 300-390s for it to be considered dead. It also means that it takes 200 - 290 seconds for celerybeat to failover, and 60 - 120 seconds for the resource_manager to failover. 

 h2. Solution 

 This time should be shorter. We should declare Pulp workers to be dead if they have been missing for 30 seconds. We should also have both the pulp_celerybeat and pulp_resource_manager failover within 30 seconds. 

 We can do this by updating the worker heartbeat to 5 seconds, the celerybeat heartbeat to 5 seconds and the worker ageout time to 25 seconds. 

 This would mean that a worker that has not checked in for 5 heartbeats (25s) would be considered missing the next time celerybeat checks (25s-30s after the last time the worker checked in) 

 In addition we need to update the current logic of the resource manager lock failover to match with celerybeat's in order to ensure a 30s failover (see comment #5 for details) 

 The proposed timings are: 
 * worker heartbeat 5s 
 * celerybeat heartbeat 5s 
 * worker ageout time 25s 
 * celerybeat lock ageout time 25 s 
 * resource manager lock heartbeat 5s 30s

Back