Project

Profile

Help

Issue #3281

closed

Several active resource managers in "celery status"

Added by dustball about 6 years ago. Updated almost 5 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

We have two active pulp clusters. Both clusters are distributed over two datacentres, they're fully virtualized. However, one of them behaves abnormally: Out of the blue, we see two or three active resource managers in celery status. Other than that, the cluster seems to behave normally so far.

We think we have found the issue that causes this: A defect in a hypervisor apparently causes machines in the broken cluster to rapidly pause and unpause. This seems to directly cause the pulp-cluster to miss heartbeats, but it's not enough to cause a split in the rabbitmq-partition or mongodb. All expected workers are there as well.

Steps to replicate (at least partially): Setup a three-node cluster. The rabbitmq-master and the active resource manager need to be on different nodes. Pause the VM running the active resource manager, wait a few minutes and unpause it. You'll now see two active resource managers, but no workers.

Proposed fix: Have active resource managers check regularly if there are other active resource managers, and have all but one go back to hot standby.

Actions #1

Updated by dalley about 6 years ago

  • Triaged changed from No to Yes
Actions #2

Updated by bmbouter almost 5 years ago

  • Status changed from NEW to CLOSED - WONTFIX
Actions #3

Updated by bmbouter almost 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF