Issue #3281
closedSeveral active resource managers in "celery status"
Description
We have two active pulp clusters. Both clusters are distributed over two datacentres, they're fully virtualized. However, one of them behaves abnormally: Out of the blue, we see two or three active resource managers in celery status. Other than that, the cluster seems to behave normally so far.
We think we have found the issue that causes this: A defect in a hypervisor apparently causes machines in the broken cluster to rapidly pause and unpause. This seems to directly cause the pulp-cluster to miss heartbeats, but it's not enough to cause a split in the rabbitmq-partition or mongodb. All expected workers are there as well.
Steps to replicate (at least partially): Setup a three-node cluster. The rabbitmq-master and the active resource manager need to be on different nodes. Pause the VM running the active resource manager, wait a few minutes and unpause it. You'll now see two active resource managers, but no workers.
Proposed fix: Have active resource managers check regularly if there are other active resource managers, and have all but one go back to hot standby.