multiple resource_managers on the same database
Let's consider following situation:
on one server running celerybeat, workers and resource_manager.
User want to run next couple of workers on another server, but accidentally run also resource_manager on another server.
Now there are worker[0-x]_srv1 and worker[0-x]_srv2 in database. User will kill resource_manager and workers on srv2.
But worker[0-x]_srv2 are still in database. Resource_manager_srv1 take care only for worker_[0-x]_srv1.
Workers from server2 are already dead but still remaining in database and pulp will happily assign tasks to them.
Solution is run resource_manager on srv2 again and wait till it clears dead workers from database or remove them manually.
In the case of manual remove workers from database, user also needs to stop all services and then start them again.
Possible ways how to prevent this:
- resource_manager will be used also for assigning tasks.
- mechanism that prevents running two or more resource_managers on one database
- resource_manager will manage all workers in db.workers, not only ones that are registered to it.
#1 Updated by bmbouter over 5 years ago
- Status changed from NEW to 7
In the upcoming release of 2.6.0 the tasking system had significant improvements made for it. See #157 for more details on the expected behavior. It will task, discover, and monitor workers across any number of machines.
Running two resource managers is not correct, but it should provide mostly correct operation. The resource locking done by the resource_manager code was designed to be concurrent, so we expect it to perform adequately even if they do mistakenly start two.
The important thing is that when the second resource manager stop that its records are correctly removed. I just did some testing with the upcoming 2.6.0 release and I correctly see:
pulp.server.async.scheduler:ERROR: Workers 'email@example.com' has gone missing, removing from list of workers pulp.server.async.tasks:ERROR: The worker named firstname.lastname@example.org is missing. Canceling the tasks in its queue
That is expected so I'm going to close this issue for now. Try using the 2.6 beta and reproducing the problem. If you can reproduce it with that then reopen the issue.
Please register to edit this issue