Project

Profile

Help

Issue #470

closed

Scheduler not able to recover from very long mongo and/or qpidd outage

Added by bmbouter almost 10 years ago. Updated over 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.4 Beta
Platform Release:
2.6.2
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Celerybeat uses pulp.server.async.scheduler which provides correct reconnect support if either Mongo or Qpid go down and then come back later. For normal outage times, minutes or hours, the reconnect support works fine. For outages that last on the order of days the user will eventually receive the following message:

pulp.server.async.scheduler:ERROR: [Errno 24] Too many open files

One the user sees that message, reconnect support no longer works, and the celerybeat service would need to be restarted. Something about the reconnect support is using a file descriptor with each reconnect attempt.

I'm not sure if it is Qpid or Mongo that causes this, so I'm identifying them both as possible causes.

To reproduce:
1. Stop all Pulp services
2. Start Mongo
3. Start Qpid
4. Start celerybeat
5. stop Mongo
6. Stop Qpid
7. Observe the reconnects trying over and over
8. Wait a long time (like overnight)
9. Observe the error message above in the logs

+ This bug was cloned from Bugzilla Bug #1118404 +

Also available in: Atom PDF