Issue #431: A key-error for connection_errors Stack dump appears when stopping pulp workers from time to time

Actions

Send by e-mail Copy link

Issue #431

closed

A key-error for connection_errors Stack dump appears when stopping pulp workers from time to time

Added by mkovacik@redhat.com about 9 years ago. Updated about 5 years ago.

Status:

CLOSED - WONTFIX

Priority:

Low

Assignee:

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

2.4 Beta

Platform Release:

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Quarter:

Description

Description of problem:
The stack dump from "Additional Info" can be seen in logs when starting/stopping pulp workers

Version-Release number of selected component (if applicable):
pulp-server-2.4.0-0.14.beta.fc20.noarch
python-kombu-3.0.15-5.pulp.fc20.noarch

How reproducible:
~1/2

Actual results:
repeating few times, the stack dump from additional info appears

Expected results:
no stack dump in during stop workers appears when accessing the connection_errors property

Additional info:

the stack dump
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping The Apache HTTP Server...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp's Celerybeat...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Resource Manager...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Worker #0...
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopping Pulp Celery Workers...
May 16 13:37:10 ec2-54-220-31-64 celery: celery beat v3.1.11 (Cipater) is starting.
May 16 13:37:10 ec2-54-220-31-64 celery: _ - ... _ -
May 16 13:37:10 ec2-54-220-31-64 celery: Configuration ->
May 16 13:37:10 ec2-54-220-31-64 celery: . broker -> qpid://guest@ec2-54-220-31-64.eu-west-1.compute.amazonaws.com:5672//
May 16 13:37:10 ec2-54-220-31-64 celery: . loader -> celery.loaders.app.AppLoader
May 16 13:37:10 ec2-54-220-31-64 celery: . scheduler -> pulp.server.async.scheduler.Scheduler
May 16 13:37:10 ec2-54-220-31-64 celery: . logfile -> [stderr]@%INFO
May 16 13:37:10 ec2-54-220-31-64 celery: . maxinterval -> now (0s)
May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess)
May 16 13:37:10 ec2-54-220-31-64 celery: worker: Warm shutdown (MainProcess)
May 16 13:37:10 ec2-54-220-31-64 celery: Exception in thread Thread-2 (most likely raised during interpreter shutdown):
May 16 13:37:10 ec2-54-220-31-64 celery: Traceback (most recent call last):
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 811, inbootstrap_inner
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib64/python2.7/threading.py", line 764, in run
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py", line 92, in monitor_events
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 732, in __exit_
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 334, in release
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 302, in _close
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 292, in _do_close_self
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 285, in maybe_close_channel
May 16 13:37:10 ec2-54-220-31-64 celery: File "/usr/lib/python2.7/site-packages/kombu/utils/_init__.py", line 320, inget_
May 16 13:37:10 ec2-54-220-31-64 celery: <type 'exceptions.KeyError'>: 'connection_errors'
May 16 13:37:10 ec2-54-220-31-64 systemd: Stopped Pulp's Celerybeat.
May 16 13:37:11 ec2-54-220-31-64 celery: reserved_resource_worker-0@ec2-54-220-31-64.eu-west-1.compute.amazonaws.com ready.
May 16 13:37:11 ec2-54-220-31-64 celery: resource_manager@ec2-54-220-31-64.eu-west-1.compute.amazonaws.com ready.
May 16 13:37:11 ec2-54-220-31-64 systemd: Stopped The Apache HTTP Server.
May 16 13:37:12 ec2-54-220-31-64 systemd: Stopped Pulp Worker #0.

+ This bug was cloned from Bugzilla Bug #1098536 +

Actions

Copy link

Updated by skarmark@redhat.com about 9 years ago

***** Bug 1098540 has been marked as a duplicate of this bug. ***

+ This comment was cloned from Bugzilla #1098536 comment 1 +

Actions

Copy link

Updated by bmbouter about 9 years ago

I believe I understand why this occurs, but I'm not sure of the best way to fix it

In the stacktrace you'll see this is related to the monitor_events[0] method, which is run inside of a thread that is spawned as part of celerybeat. This method uses a connection to the broker, which in this case is qpid. that connection uses kombu and qpid.messaging underneath.

The teardown order of the threads I believe is the cause of this exception. If the threads used by qpid.messaging exit first, then it would cause a connection error to be raised on the Receiver object used by the monitor_events thread, which may not yet have exited.

There is already a try/accept around the monitor_events() call[1] and that code is designed to log the exception as an error, sleep for 10 seconds, and then reenter monitor_events() again. Of course the thread will have exited long before that. The only downside here is the stack trace, but I want real connection errors captured also, so I'm cautious to silence this exception.

[0]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L152

[1]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L146

+ This comment was cloned from Bugzilla #1098536 comment 2 +

Actions

Copy link

Updated by bmbouter about 9 years ago

One thing we could do would be to reorder the sleep(10) and the logging statement here[0] causing the thread to delay logging until it almost certainly exited. Delayed logging isn't great, so maybe we could delay be 2 seconds, log, and then sleep for 8 seconds. 2 seconds should be enough time for all daemon thread to terminate.

[0]: https://github.com/pulp/pulp/blob/0acab9c455d9976f904c489553345e2c0ed57444/server/pulp/server/async/scheduler.py#L149

+ This comment was cloned from Bugzilla #1098536 comment 3 +

Actions

Copy link

Updated by bmbouter about 9 years ago

There is a better way to fix this. The EventMonitor Thread should be converted to be a non-daemon thread, and wait for a signal to shutdown properly. During thread exit, it should properly de-allocate the resources used by the thread.

See the note in the Threading library[0] that recommends this very fix for this type of problem.

[0]: https://docs.python.org/2/library/threading.html#thread-objects

+ This comment was cloned from Bugzilla #1098536 comment 4 +

Actions

Copy link

Updated by bmbouter about 9 years ago

Severity changed from Medium to 2. Medium

Actions

Copy link

Updated by bmbouter about 5 years ago

Status changed from NEW to CLOSED - WONTFIX

Actions

Copy link

Updated by bmbouter about 5 years ago

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

Actions

Copy link

Updated by bmbouter about 5 years ago

Tags Pulp 2 added

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Pulp

Agile boards

Custom queries

Issue #431

A key-error for connection_errors Stack dump appears when stopping pulp workers from time to time

Updated by skarmark@redhat.com about 9 years ago

Updated by bmbouter about 9 years ago

Updated by bmbouter about 9 years ago

Updated by bmbouter about 9 years ago

Updated by bmbouter about 9 years ago

Updated by bmbouter about 5 years ago

Updated by bmbouter about 5 years ago

Updated by bmbouter about 5 years ago