Project

Profile

Help

Issue #470

closed

Scheduler not able to recover from very long mongo and/or qpidd outage

Added by bmbouter about 9 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.4 Beta
Platform Release:
2.6.2
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Celerybeat uses pulp.server.async.scheduler which provides correct reconnect support if either Mongo or Qpid go down and then come back later. For normal outage times, minutes or hours, the reconnect support works fine. For outages that last on the order of days the user will eventually receive the following message:

pulp.server.async.scheduler:ERROR: [Errno 24] Too many open files

One the user sees that message, reconnect support no longer works, and the celerybeat service would need to be restarted. Something about the reconnect support is using a file descriptor with each reconnect attempt.

I'm not sure if it is Qpid or Mongo that causes this, so I'm identifying them both as possible causes.

To reproduce:
1. Stop all Pulp services
2. Start Mongo
3. Start Qpid
4. Start celerybeat
5. stop Mongo
6. Stop Qpid
7. Observe the reconnects trying over and over
8. Wait a long time (like overnight)
9. Observe the error message above in the logs

+ This bug was cloned from Bugzilla Bug #1118404 +

Actions #1

Updated by bmbouter about 9 years ago

The failing component is likely long-term reconnect with qpidd because there is another error with pulp_celerybeat that occurs if mongod is not available [0]. That other error is experienced on a much shorter timeline (minutes) with a different traceback. By deduction I believe this BZ is likely caused by reconnect support with respect to qpidd.

[0]: https://bugzilla.redhat.com/show_bug.cgi?id=1129488

+ This comment was cloned from Bugzilla #1118404 comment 1 +

Actions #2

Updated by bmbouter about 9 years ago

  • Severity changed from Low to 1. Low
Actions #4

Updated by bmbouter about 9 years ago

The recommended workaround for this is two-fold. First, avoid getting into this situation by not having qpidd unavailable from pulp_celerybeat for hours/days. Second, if you do experience the issue you can recover by restarting pulp_celerybeat. pulp_celerybeat should be perfectly save to restart at any time.

Actions #5

Updated by bmbouter about 9 years ago

  • Severity changed from 1. Low to 3. High

Raising the severity so this will be looked into sooner.

Actions #6

Updated by jortel@redhat.com almost 9 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to jortel@redhat.com
Actions #7

Updated by jortel@redhat.com almost 9 years ago

Unable to reproduce with either qpid.messaging in isolation. The process is leaking file descriptors opened on pipes (not sockets).
The Transport in kombu.transport.qpid opens a pipe in its constructor which (best I can tell) is never closed.

Still investigating.

Added by bmbouter almost 9 years ago

Revision a99b9955 | View on GitHub

Fixes a file descriptor leak in python-kombu

This is a fix for upstream issue 476. https://github.com/celery/kombu/issues/476

fixes #470 https://pulp.plan.io/issues/470

Added by bmbouter almost 9 years ago

Revision a99b9955 | View on GitHub

Fixes a file descriptor leak in python-kombu

This is a fix for upstream issue 476. https://github.com/celery/kombu/issues/476

fixes #470 https://pulp.plan.io/issues/470

Actions #9

Updated by bmbouter almost 9 years ago

  • Status changed from ASSIGNED to POST
  • Platform Release set to 2.6.2
Actions #10

Updated by bmbouter almost 9 years ago

QE, do the following to reproduce:

0) Install python-kombu 3.0.24-7.pulp (this should come with pulp 2.6.2)
1) Leave all pulp_* and httpd services off
2) ensure qpidd and mongodb are running
3) start pulp_celerybeat and verify in the log that it starts ok
4) stop qpidd
5) observe the log statements that show pulp_celerybeat is trying to reconnect
6) Determine the process id of pulp_celerybeat, say it's 12345
7) Use lsof to count the number of pipes by running:

sudo lsof -p 12345 | grep pipe | wc

8) Let pulp_celerybeat attempt to reconnect over say 5-10 minutes. pulp_celerybeat should not have been restarted since step 3
9) Use lsof again after the 5-10 minutes and verify the first number is the same. It was 6 on my machine when I did this.

Actions #11

Updated by bmbouter almost 9 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100
Actions #12

Updated by dkliban@redhat.com almost 9 years ago

  • Status changed from MODIFIED to 5
Actions #14

Updated by pthomas@redhat.com almost 9 years ago

Verified

[root@mgmt12 ~]# rpm -qa pulp-server
pulp-server-2.6.2-0.2.beta.el6.noarch
[root@mgmt12 ~]# 

[root@mgmt12 ~]# service qpidd stop
Stopping Qpid AMQP daemon: [  OK  ]
[root@mgmt12 ~]# 
[root@mgmt12 ~]# 
[root@mgmt12 ~]# service pulp_celerybeat status
celery init v10.0.
Using configuration: /etc/default/pulp_workers, /etc/default/pulp_celerybeat
pulp_celerybeat (pid 818) is running.
[root@mgmt12 ~]# sudo lsof -p 818 | grep pipe | wc
      6      54     432
[root@mgmt12 ~]# 
[root@mgmt12 ~]# 
[root@mgmt12 ~]# sudo lsof -p 818 | grep pipe | wc
      6      54     432
[root@mgmt12 ~]# 
Actions #15

Updated by pthomas@redhat.com almost 9 years ago

  • Status changed from 5 to 6
Actions #16

Updated by dkliban@redhat.com over 8 years ago

  • Status changed from 6 to CLOSED - CURRENTRELEASE
Actions #18

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF