Issue #1114

Updated by bmbouter over 8 years ago

1. Start a clustered pulp installation with two machines in the cluster. Suppose these hostnames are called boxA and boxB. Start all pulp_* and httpd services on boxA. 
 2. Start a second pulp_resource_manager instance on boxB. 
 3. Use the /status/ API to verify that you can see both entries. They should show as 'resource_manager@boxA' and 'resource_manager@boxB'. 
 4. kill -9 the pulp_resource_manager service on boxB. 
 5. Wait for 6 or 7 minutes 
 6. Observe a traceback similar to the following in the logs of boxA. 

 pulp.server.async.scheduler:ERROR: Workers 'resource_manager@boxB' has gone missing, removing from list of workers 
 pulp.server.async.tasks:ERROR: The worker named resource_manager@boxB is missing. Canceling the tasks in its queue. 

 Two things are wrong with this, and both of them are located in "this section of code": 

 (1) It should never call _delete_worker( which attempts to cancel tasks, log, and clean up reservations, none of which make sense to do for pulp_resource_manager. Instead it should delete the worker record synchronously and continue. 

 (2) The error message is misleading. I'll suggest it should read something like: 

 resource_manager@boxB scheduler@boxB has gone missing.