Issue #3281: Several active resource managers in "celery status" - Pulp

Actions

Send by e-mail Copy link

Issue #3281

closed

Several active resource managers in "celery status"

Added by dustball about 7 years ago. Updated almost 6 years ago.

Status:

CLOSED - WONTFIX

Priority:

Normal

Assignee:

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Quarter:

Description

We have two active pulp clusters. Both clusters are distributed over two datacentres, they're fully virtualized. However, one of them behaves abnormally: Out of the blue, we see two or three active resource managers in celery status. Other than that, the cluster seems to behave normally so far.

We think we have found the issue that causes this: A defect in a hypervisor apparently causes machines in the broken cluster to rapidly pause and unpause. This seems to directly cause the pulp-cluster to miss heartbeats, but it's not enough to cause a split in the rabbitmq-partition or mongodb. All expected workers are there as well.

Steps to replicate (at least partially): Setup a three-node cluster. The rabbitmq-master and the active resource manager need to be on different nodes. Pause the VM running the active resource manager, wait a few minutes and unpause it. You'll now see two active resource managers, but no workers.

Proposed fix: Have active resource managers check regularly if there are other active resource managers, and have all but one go back to hot standby.