Project

Profile

Help

Issue #431

A key-error for connection_errors Stack dump appears when stopping pulp workers from time to time

Added by mkovacik@redhat.com over 5 years ago. Updated over 1 year ago.

Status:
CLOSED - WONTFIX
Priority:
Low
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
2.4 Beta
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Description of problem:
The stack dump from "Additional Info" can be seen in logs when starting/stopping pulp workers

Version-Release number of selected component (if applicable):
pulp-server-2.4.0-0.14.beta.fc20.noarch
python-kombu-3.0.15-5.pulp.fc20.noarch

How reproducible:
~1/2

Steps to Reproduce:
1. systemctl stop `systemctl list-unit-files | egrep 'pulp|httpd|goferd' | cut -d\ -f1
2. check /var/log messages for the stack dump
2. ystemctl start `systemctl list-unit-files | egrep 'pulp|httpd|goferd' | cut -d\ -f1
4. goto 1.

Actual results:
repeating few times, the stack dump from additional info appears

Expected results:
no stack dump in during stop workers appears when accessing the connection_errors property

Additional info:

  1. the stack dump
    May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping The Apache HTTP Server...
    May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp's Celerybeat...
    May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Resource Manager...
    May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Worker #0...
    May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Celery Workers...
    May 16 13:37:10 ec2-54-220-31-64 celery: celery beat v3.1.11 (Cipater) is starting.
    May 16 13:37:10 ec2-54-220-31-64 celery: _ - ... _ -
    May 16 13:37:10 ec2-54-220-31-64 celery: Configuration ->
    May 16 13:37:10 ec2-54-220-31-64 celery: . broker -> qpid://guest@ec2-54-220-31-64.eu-west-1.compute.amazonaws.com:5672//
    May 16 13:37:10 ec2-54-220-31-64 celery: . loader -> celery.loaders.app.AppLoader
    May 16 13:37:10 ec2-54-220-31-64 celery: . scheduler -> pulp.server.async.scheduler.Scheduler
    May 16 13:37:10 ec2-54-220-31-64 celery: . logfile -> [stderr]@%INFO
    May 16 13:37:10 ec2-54-220-31-64 celery: . maxinterval -> now (0s)
    May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess)
    May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess)
    May 16 13:37:10 ec2-54-220-31-64 celery: Exception in thread Thread-2 (most likely raised during interpreter shutdown):
    May 16 13:37:10 ec2-54-220-31-64 celery: Traceback (most recent call last):
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 811, inbootstrap_inner
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 764, in run
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py", line 92, in monitor_events
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 732, in __exit
    _
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 334, in release
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 302, in _close
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 292, in _do_close_self
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 285, in maybe_close_channel
    May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/utils/_init__.py", line 320, inget_
    May 16 13:37:10 ec2-54-220-31-64 celery: <type 'exceptions.KeyError'>: 'connection_errors'
    May 16 13:37:10 ec2-54-220-31-64 systemd: Stopped Pulp's Celerybeat.
    May 16 13:37:11 ec2-54-220-31-64 celery: ready.
    May 16 13:37:11 ec2-54-220-31-64 celery: ready.
    May 16 13:37:11 ec2-54-220-31-64 systemd: Stopped The Apache HTTP Server.
    May 16 13:37:12 ec2-54-220-31-64 systemd: Stopped Pulp Worker #0.

+ This bug was cloned from Bugzilla Bug #1098536 +

History

#1 Updated by skarmark@redhat.com over 5 years ago

***** Bug 1098540 has been marked as a duplicate of this bug. ***

+ This comment was cloned from Bugzilla #1098536 comment 1 +

#2 Updated by bmbouter over 5 years ago

I believe I understand why this occurs, but I'm not sure of the best way to fix it

In the stacktrace you'll see this is related to the monitor_events[0] method, which is run inside of a thread that is spawned as part of celerybeat. This method uses a connection to the broker, which in this case is qpid. that connection uses kombu and qpid.messaging underneath.

The teardown order of the threads I believe is the cause of this exception. If the threads used by qpid.messaging exit first, then it would cause a connection error to be raised on the Receiver object used by the monitor_events thread, which may not yet have exited.

There is already a try/accept around the monitor_events() call[1] and that code is designed to log the exception as an error, sleep for 10 seconds, and then reenter monitor_events() again. Of course the thread will have exited long before that. The only downside here is the stack trace, but I want real connection errors captured also, so I'm cautious to silence this exception.

[0]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L152

[1]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L146

+ This comment was cloned from Bugzilla #1098536 comment 2 +

#3 Updated by bmbouter over 5 years ago

One thing we could do would be to reorder the sleep(10) and the logging statement here[0] causing the thread to delay logging until it almost certainly exited. Delayed logging isn't great, so maybe we could delay be 2 seconds, log, and then sleep for 8 seconds. 2 seconds should be enough time for all daemon thread to terminate.

[0]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L149

+ This comment was cloned from Bugzilla #1098536 comment 3 +

#4 Updated by bmbouter over 5 years ago

There is a better way to fix this. The EventMonitor Thread should be converted to be a non-daemon thread, and wait for a signal to shutdown properly. During thread exit, it should properly de-allocate the resources used by the thread.

See the note in the Threading library[0] that recommends this very fix for this type of problem.

[0]: https://docs.python.org/2/library/threading.html#thread-objects

+ This comment was cloned from Bugzilla #1098536 comment 4 +

#5 Updated by bmbouter over 5 years ago

  • Severity changed from Medium to 2. Medium

#6 Updated by bmbouter over 1 year ago

  • Status changed from NEW to CLOSED - WONTFIX

#7 Updated by bmbouter over 1 year ago

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

#8 Updated by bmbouter over 1 year ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF