Issue #1114
Updated by bmbouter over 9 years ago
1. Start a clustered pulp installation with two machines in the cluster. Suppose these hostnames are called boxA and boxB. Start all pulp_* and httpd services on boxA. 2. Start a second pulp_resource_manager instance on boxB. 3. Use the /status/ API to verify that you can see both entries. They should show as 'resource_manager@boxA' and 'resource_manager@boxB'. 4. kill -9 the pulp_resource_manager service on boxB. 5. Wait for 6 or 7 minutes 6. Observe a traceback similar to the following in the logs of boxA. <pre> pulp.server.async.scheduler:ERROR: Workers 'resource_manager@boxB' has gone missing, removing from list of workers pulp.server.async.tasks:ERROR: The worker named resource_manager@boxB is missing. Canceling the tasks in its queue. </pre> Two things are wrong with this, and both of them are located in "this section of code":https://github.com/pulp/pulp/blob/2e250c92d2bf58a1759cf7931af3976b7cef6e28/server/pulp/server/async/scheduler.py#L234-L236. (1) It should never call _delete_worker(worker.name) which attempts to cancel tasks, log, and clean up reservations, none of which make sense to do for pulp_resource_manager. Instead it should delete the worker record synchronously and continue. (2) The error message is misleading. I'll suggest it should read something like: <pre> resource_manager@boxB scheduler@boxB has gone missing. </pre>