Network maintenance. Planio will be observing two scheduled maintenance windows this Tuesday, March 2 and Wednesday, March 3 from 02:00 UTC until 06:00 UTC each in order to perform maintenance on access routers in our primary datacenter. Your account might observe short downtimes during these periods up to several minutes at a time.
multiple resource_managers on the same database
Let's consider following situation:
on one server running celerybeat, workers and resource_manager.
User want to run next couple of workers on another server, but accidentally run also resource_manager on another server.
Now there are worker[0-x]_srv1 and worker[0-x]_srv2 in database. User will kill resource_manager and workers on srv2.
But worker[0-x]_srv2 are still in database. Resource_manager_srv1 take care only for worker_[0-x]_srv1.
Workers from server2 are already dead but still remaining in database and pulp will happily assign tasks to them.
Solution is run resource_manager on srv2 again and wait till it clears dead workers from database or remove them manually.
In the case of manual remove workers from database, user also needs to stop all services and then start them again.
Possible ways how to prevent this:
- resource_manager will be used also for assigning tasks.
- mechanism that prevents running two or more resource_managers on one database
- resource_manager will manage all workers in db.workers, not only ones that are registered to it.
#1 Updated by bmbouter almost 6 years ago
- Status changed from NEW to 7
In the upcoming release of 2.6.0 the tasking system had significant improvements made for it. See #157 for more details on the expected behavior. It will task, discover, and monitor workers across any number of machines.
Running two resource managers is not correct, but it should provide mostly correct operation. The resource locking done by the resource_manager code was designed to be concurrent, so we expect it to perform adequately even if they do mistakenly start two.
The important thing is that when the second resource manager stop that its records are correctly removed. I just did some testing with the upcoming 2.6.0 release and I correctly see:
pulp.server.async.scheduler:ERROR: Workers 'firstname.lastname@example.org' has gone missing, removing from list of workers pulp.server.async.tasks:ERROR: The worker named email@example.com is missing. Canceling the tasks in its queue
That is expected so I'm going to close this issue for now. Try using the 2.6 beta and reproducing the problem. If you can reproduce it with that then reopen the issue.
Please register to edit this issue