Issue #1114: If an instance of pulp_resource_manager dies unexpectedly, Pulp incorrectly tries to "cancel all tasks in its queue"

Issue #1114

1. Start a clustered pulp installation with two machines in the cluster. Suppose these hostnames are called boxA and boxB. Start all pulp_* and httpd services on boxA. 
 2. Start a second pulp_resource_manager instance on boxB. 
 3. Use the /status/ API to verify that you can see both entries. They should show as 'resource_manager@boxA' and 'resource_manager@boxB'. 
 4. kill -9 the pulp_resource_manager service on boxB. 
 5. Wait for 6 or 7 minutes 
 6. Observe a traceback similar to the following in the logs of boxA. 

 <pre> 
 pulp.server.async.scheduler:ERROR: Workers 'resource_manager@boxB' has gone missing, removing from list of workers 
 pulp.server.async.tasks:ERROR: The worker named resource_manager@boxB is missing. Canceling the tasks in its queue. 
 </pre> 

 Two things are wrong with this, and both of them are located in "this section of code":https://github.com/pulp/pulp/blob/2e250c92d2bf58a1759cf7931af3976b7cef6e28/server/pulp/server/async/scheduler.py#L234-L236. 

 (1) It should never call _delete_worker(worker.name) which attempts to cancel tasks, log, and clean up reservations, none of which make sense to do for pulp_resource_manager. Instead it should delete the worker record synchronously and continue. 

 (2) The error message is misleading. I'll suggest it should read something like: 

 <pre> 
 resource_manager@boxB scheduler@boxB has gone missing. 
 </pre>

Back

Project

Profile

Help

Pulp

Issue #1114