Issue #3129: occasional httpd segfault - Pulp

Actions

Send by e-mail Copy link

Issue #3129

closed

occasional httpd segfault

Added by cduryee over 6 years ago. Updated about 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

High

Assignee:

bmbouter

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

3. High

Version:

Platform Release:

2.15.2

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Sprint 30

Quarter:

Description

NOTE: this fix requires an updated version of gofer. There isn't an associated Pulp commit.¶

Occasionally, the httpd instance that the Pulp wsgi app runs on will segfault. This causes httpd to do a graceful, but the restart can have effects on other applications in the same httpd instance.

I have observed this on at least four different machines:

[Sun Nov  5 03:28:12 2017] httpd[27508]: segfault at 8 ip 00007f6ef3d7aa90 sp 00007f6ed6df3d70 error 4 in libpython2.7.so.1.0[7f6ef3c7b000+17d000]

Unfortunately I do not know how to reproduce this issue.

Files

Download all files

t_a_a_py_bt.txt (23.6 KB) t_a_a_py_bt.txt	py-bt from all threads	bmbouter, 11/15/2017 12:34 AM
t_a_a_bt.txt (67.7 KB) t_a_a_bt.txt	bt from all threads	bmbouter, 11/15/2017 12:34 AM
gunicorn_working_with_pulp.diff (3.16 KB) gunicorn_working_with_pulp.diff		bmbouter, 12/07/2017 09:49 PM

Actions

Copy link

Updated by ipanova@redhat.com over 6 years ago

Please provide the version of pulp this happens.
looks like a dup of https://pulp.plan.io/issues/2124

Actions

Copy link

Updated by bmbouter over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by bmbouter over 6 years ago

File t_a_a_bt.txt t_a_a_bt.txt added
File t_a_a_py_bt.txt t_a_a_py_bt.txt added

I was given a coredump from a machine that experienced the segfault. Attached is a t a a bt output as bt.txt and t a a py-bt as py-bt.txt.

Actions

Copy link

Updated by bmbouter over 6 years ago

Here are some of the things I see in the attached backtraces. I think the root cause is a "double free" problem during the deconstructor.

C callstack analysis¶

The call stack shows that the WSGI application is calling into goferd code which calls into qpid.messaging and then into Django. The C stack trace shows that wait() in qpid/compat.py, line 127 code is calling _remove_receiver() in django/dispatch/dispatcher.py, line 282 as a "normal" function call. It's "normal" in the sense that there is the typical C stack frame of PyEval_EvalFrameEx and then call_function. This means that the reference to _remove_receiver was replaced with a reference to Django code somehow.

While Django is running, the segfault occurs in the _PyTrash_thread_destroy_chain. This (the very last) stack frame that segfaults is not "normal" Python code execution in the sense that it's not running "user code" that I can see. The _PyTrash_thread_destroy_chain is not widely documented, but it looks like a Python internal things. Also since it's not a normal Python error you get a segfault, not an AttributeError or some other language-backed Python exception.

The _PyTrash_thread_destroy_chain is an internal Python thing. It's had some issues in the past which caused segfaults https://bugs.python.org/issue13992 That issue was fixed, but it has a very similar gdb output especially at frame 0. They said they were experiencing a "double free" problem where the descructor is being called twice and the Python interpreter goes to deconstruct the object the second time, but it's not there.

Celery, Kombu, and the Qpid transport for Kombu are not involved. They aren't shown anywhere on the call stack of the segfaulted thread. It's possible threads are sharing state with the crashing thread in a way that doesn't preclude their involvement, but there is no evidence suggesting that currently.

I heard that this has been occurring for a long time, so it's not a new regression. Users back to 2.8 have confirmed seeing the segfault logs.

A theory about severity¶

Each webserver process in the process group itself is also multi-threaded. A segfault crashes the process (and all threads) but does not affect the other webserver processes thanks to process isolation. During low and medium load, the chances of more than 1 thread in a given process handling work is lower than in a high load environment. The theory is that a segfaulting process in a low or medium load environment can go unnoticed easily because of the limited amount of data and operations affected. In a high-load environment, any other operations occurring in other threads at the moment the crash occurs will also be affected.

Next steps¶

I think these are some key questions to try to answer with more evidence:

Does the coredump stacktrace involve the goferd -> qpid.messaging -> Django call stack with each crash?
How is qpid code in compat.py getting a reference to a Django function?
What exactly is the _PyTrash_thread_destroy_chain call and is this a bug in Python?
Who is starting all of the threads? Is it httpd or are some from qpid.messaging?

Actions

Copy link

Updated by dalley over 6 years ago

Priority changed from Normal to High
Severity changed from 2. Medium to 3. High
Triaged changed from No to Yes

Actions

Copy link

Updated by rchan over 6 years ago

Sprint/Milestone set to 48

Actions

Copy link

Updated by bmbouter over 6 years ago

On my upstream master checkout I am not yet able to reproduce it. Here's what I did:

1. Start w/ a fresh developer install and sanity test it w/ the zoo repo

2. Configure the webserver to emit coredumps on segfault https://httpd.apache.org/dev/debugging.html#crashes

3. Run 1000 sync's in a loop, and at the same time run a several thousand consumer operations

In term 1:

for i in {1..1000}; do pulp-admin rpm repo sync run --repo-id zoo --force; done

In term 2:

[vagrant@pulp2 devel]$ cat reproducer.sh 
#!/bin/bash

sudo pulp-consumer -u admin -p admin register --consumer-id c1
pulp-admin rpm consumer bind --consumer-id c1 --repo-id zoo
pulp-admin rpm consumer package update run --consumer-id c1
pulp-admin rpm consumer package install run --consumer-id c1 --name tiger
pulp-admin rpm consumer package uninstall run --consumer-id c1 --name tiger
pulp-admin rpm consumer package uninstall run --consumer-id c1 --name tiger-types
pulp-admin rpm consumer package uninstall run --consumer-id c1 --name tiger-types-javadoc
pulp-admin rpm consumer unbind --consumer-id c1 --repo-id zoo
sudo pulp-consumer -u admin -p admin unregister

[vagrant@pulp2 devel]$ for i in {1..1000}; do ./reproducer.sh; done

4. Monitor the logs for segfault notices during that test:

[vagrant@pulp2 devel]$ sudo journalctl -f -l | grep fault

Actions

Copy link

#10

Updated by bmbouter over 6 years ago

Thanks to @daviddavis for a suggesiton. While doing all of ^, now I'm also reloading httpd 1000 times:

for i in {1..1000}; do sudo systemctl reload httpd; sleep 10; done

Actions

Copy link

#11

Updated by bmbouter over 6 years ago

When I switched it to force-reload as in:

for i in {1..1000}; do sudo systemctl force-reload httpd; sleep 10; done

After some time it did reproduce, and I saw this in the logs.

Nov 30 21:19:51 pulp2.dev kernel: httpd[22343]: segfault at b8 ip 00007f196c4f7bd0 sp 00007f1951ea92b0 error 4 in libpython2.7.so.1.0[7f196c3ae000+1e1000]

This is with version Django 1.9.13 while the OP environment was Django 1.6.11

I didn't have coredumps configured correctly at the system level so I'm rerunning it again.

Actions

Copy link

#14

Updated by bmbouter over 6 years ago

On my upstream reproducer it says it segfaulted in the same _PyTrash_thread_destroy_chain Python code:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fefe9be7740 in _PyTrash_thread_destroy_chain () from /usr/lib64/libpython2.7.so.1.0

That thread gives a py-bt output that once again includes gofer, qpid, and django's _remove_receiver() method. Here's the corresponding callstack:

Thread 1 (Thread 0x7fefcf653700 (LWP 8128)):
Traceback (most recent call first):
  File "/usr/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 294, in _remove_receiver
    self._dead_receivers = True
  File "/usr/lib/python2.7/site-packages/qpid/compat.py", line 127, in wait
    ready, _, _ = select([self], [], [], timeout)
  File "/usr/lib/python2.7/site-packages/qpid/concurrency.py", line 96, in wait
    sw.wait(timeout)
  File "/usr/lib/python2.7/site-packages/qpid/concurrency.py", line 59, in wait
    self.condition.wait(timeout - passed)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 252, in _wait
    return self._waiter.wait(predicate, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 273, in _ewait
    result = self._wait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 637, in _ewait
    result = self.connection._ewait(lambda: self.error or predicate(), timeout)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 730, in _get
    timeout):
  File "<string>", line 6, in _get
    (in an eval block)
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 1152, in fetch
    msg = self.session._get(self, timeout=timeout)
  File "<string>", line 6, in fetch
    (in an eval block)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/qpid/consumer.py", line 116, in get
    impl = self.receiver.fetch(timeout or NO_DELAY)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/qpid/reliability.py", line 36, in _fn
    return fn(thing, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 620, in get
    return self._impl.get(timeout)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 39, in _fn
    return fn(*args, **keywords)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 654, in next
    message = self.get(timeout)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 39, in _fn
    return fn(*args, **keywords)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 93, in read
    message, document = reader.next(wait)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 61, in run
    self.read()
  File "/usr/lib/python2.7/site-packages/gofer/common.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/usr/lib64/python2.7/threading.py", line 804, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 777, in __bootstrap
    self.__bootstrap_inner()

Notice that in the downstream reproducer it was using Django version 1.6.11 and the upstream reproducer uses Django 1.9.13. The _remove_receiver() method changed pretty significantly in there from the 1.6.11 implementation to the 1.9.13 implementation That gives us a useful piece of information that the more complex 1.6.11 _remove_receiver() code wasn't the problem.

Actions

Copy link

#15

Updated by bmbouter over 6 years ago

@jortel, can you explain the threading model with this part of gofer's code? Is the thread with the callstack in the above comment a "background" thread?

Actions

Copy link

#16

Updated by bmbouter over 6 years ago

I configured the reproducer system to use rabbitMQ, and it also produced a segfault on the first try:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f968d280740 in _PyTrash_thread_destroy_chain () from /usr/lib64/libpython2.7.so.1.0

Thread 1 shows the same thing, but now not involving Django and only goferd:

Thread 1 (Thread 0x7f967390e700 (LWP 21764)):
Traceback (most recent call first):
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/amqp/consumer.py", line 162, in _wait
    if epoll.poll(timeout):
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/amqp/consumer.py", line 219, in fetch
    self._wait(fd, channel, timeout)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/amqp/consumer.py", line 110, in get
    impl = self.receiver.fetch(timeout or NO_DELAY)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/amqp/reliability.py", line 35, in _fn
    return fn(messenger, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 620, in get
    return self._impl.get(timeout)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 39, in _fn
    return fn(*args, **keywords)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 654, in next
    message = self.get(timeout)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/adapter/model.py", line 39, in _fn
    return fn(*args, **keywords)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 93, in read
    message, document = reader.next(wait)
  File "/usr/lib/python2.7/site-packages/gofer/messaging/consumer.py", line 61, in run
    self.read()
  File "/usr/lib/python2.7/site-packages/gofer/common.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/usr/lib64/python2.7/threading.py", line 804, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 777, in __bootstrap
    self.__bootstrap_inner()

Thread 1 is the segfaulting thread, which I confirmed with the t a a bt output which in Thread 1 shows this at the top of the callstack:

Thread 1 (Thread 0x7f967390e700 (LWP 21764)):
#0  0x00007f968d280740 in _PyTrash_thread_destroy_chain () from /usr/lib64/libpython2.7.so.1.0
#1  0x00007f968d32ba93 in call_function (oparg=<optimized out>, pp_stack=0x7f967390c098) at /usr/src/debug/Python-2.7.13/Python/ceval.c:4431
#2  PyEval_EvalFrameEx (f=f@entry=Frame 0x7f96700c0bc0, for file /usr/lib/python2.7/site-packages/gofer/messaging/adapter/amqp/consumer.py, line 162, in _wait (), 
    throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.13/Python/ceval.c:3063
#3  0x00007f968d32bbbe in fast_function (nk=0, na=<optimized out>, n=<optimized out>, pp_stack=0x7f967390c1d8, func=<optimized out>)
    at /usr/src/debug/Python-2.7.13/Python/ceval.c:4514
#4  call_function (oparg=<optimized out>, pp_stack=0x7f967390c1d8) at /usr/src/debug/Python-2.7.13/Python/ceval.c:4449
Python Exception <class 'gdb.error'> There is no member named ob_ival.: 
#5  PyEval_EvalFrameEx (f=f@entry=, throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.13/Python/ceval.c:3063
#6  0x00007f968d32f5ec in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, 
    kws=0x7f9673925f10, kwcount=0, defs=0x7f9673973ba8, defcount=1, closure=0x0) at /usr/src/debug/Python-2.7.13/Python/ceval.c:3661
#7  0x00007f968d32bb1c in fast_function (nk=0, na=<optimized out>, n=<optimized out>, pp_stack=0x7f967390c3d8, func=<optimized out>)
    at /usr/src/debug/Python-2.7.13/Python/ceval.c:4524

Actions

Copy link

#17

Updated by jortel@redhat.com over 6 years ago

bmbouter wrote:

@jortel, can you explain the threading model with this part of gofer's code? Is the thread with the callstack in the above comment a "background" thread?

The Consumer IsA Thread and provides a consistent asynchronous dispatching model for all messaging lib. Really just a thread that reads a queue and invokes Consumer.dispatch() with each message.

Actions

Copy link

#18

Updated by bmbouter over 6 years ago

File gunicorn_working_with_pulp.diff gunicorn_working_with_pulp.diff added

In #python I was asking about this, and they suggested that the issue could be in mod_wsgi itself. They suggested that I try to reproduce outside of mod_wsgi by running Pulp under Gunicorn and having Apache provide a reverse proxy. I configured Pulp this way, and I had to disable authorization because that is welded to how apache handles the WSGI interface differently from Gunicorn. Specifically I applied the attached patch to get it going. The diff also disables mod_wsgi and configures Apache to use mod_proxy instead.

I run Gunicorn in the Pulp2 dev w/ the attached diff using:

workon pulp
pip install gunicorn
cd ~/devel/
gunicorn -w 4 --env DJANGO_SETTINGS_MODULE=pulp.server.webservices.settings gunicorn_pulp:application

Note also that in the same directory I made a file called gunicorn_pulp:

[vagrant@pulp2 devel]$ cat gunicorn_pulp.py
from pulp.server.webservices.application import wsgi_application

application = wsgi_application()

It's force-reloading Apache 1000 times now.

Actions

Copy link

#19

Updated by bmbouter over 6 years ago

Also I learned more about reload and force-reload. Using strace I traced httpd when receiving both commands and both sent httpd a SIGUSR1 signal to cause the reload.

I also wanted to cause Gunicorn to reload continuously so I am running this concurrently also:

for i in {1..1000}; do kill -s SIGHUP $(ps -awfux | grep gunicorn | grep S\+ | grep pulp\.server | awk '{print $2}'); sleep 15; done

I should receive a coredump if it segfaults while reloading with a HUP. The HUP is definitly the right signal for it to reload with per its docs.

Note that when running the HUP, the gunicorn output confirms its being reloaded with output like:

[2017-12-07 21:11:17 +0000] [24512] [INFO] Handling signal: hup
[2017-12-07 21:11:17 +0000] [24512] [INFO] Hang up: Master
[2017-12-07 21:11:17 +0000] [27550] [INFO] Booting worker with pid: 27550
[2017-12-07 21:11:17 +0000] [27551] [INFO] Booting worker with pid: 27551
[2017-12-07 21:11:17 +0000] [27226] [INFO] Worker exiting (pid: 27226)
[2017-12-07 21:11:17 +0000] [27223] [INFO] Worker exiting (pid: 27223)
[2017-12-07 21:11:17 +0000] [27224] [INFO] Worker exiting (pid: 27224)
[2017-12-07 21:11:17 +0000] [27225] [INFO] Worker exiting (pid: 27225)
[2017-12-07 21:11:17 +0000] [27553] [INFO] Booting worker with pid: 27553
[2017-12-07 21:11:17 +0000] [27552] [INFO] Booting worker with pid: 27552
[2017-12-07 21:11:32 +0000] [24512] [INFO] Handling signal: hup
[2017-12-07 21:11:32 +0000] [24512] [INFO] Hang up: Master
[2017-12-07 21:11:32 +0000] [27864] [INFO] Booting worker with pid: 27864
[2017-12-07 21:11:32 +0000] [27866] [INFO] Booting worker with pid: 27866
[2017-12-07 21:11:32 +0000] [27865] [INFO] Booting worker with pid: 27865
[2017-12-07 21:11:32 +0000] [27551] [INFO] Worker exiting (pid: 27551)
[2017-12-07 21:11:32 +0000] [27553] [INFO] Worker exiting (pid: 27553)
[2017-12-07 21:11:32 +0000] [27550] [INFO] Worker exiting (pid: 27550)
[2017-12-07 21:11:32 +0000] [27552] [INFO] Worker exiting (pid: 27552)
[2017-12-07 21:11:32 +0000] [27867] [INFO] Booting worker with pid: 27867

Actions

Copy link

#20

Updated by dkliban@redhat.com over 6 years ago

Here is a potentially related bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1445540

Actions

Copy link

#21

Updated by bmbouter over 6 years ago

In continuously reloading httpd with mod_proxy and gunicorn and dispatching sync and consumer action work into Pulp, I could not reproduce any segfaults. This suggests that mod_wsgi is somehow related.

I created a separate reproducer which waits in an epoll loop and sent some sighups to it. This is an attempt to isolate the bug inside of mod_wsgi and outside of gofer. It did not reproduce which suggests that this reproducer (below) is not the right set of conditions, or the issue is somehow gofer related.

I posted an issue against mod_wsgi upstream here: https://github.com/GrahamDumpleton/mod_wsgi/issues/250

import socket, select

EOL1 = b'\n\n'
EOL2 = b'\n\r\n'
response  = b'HTTP/1.0 200 OK\r\nDate: Mon, 1 Jan 1996 01:01:01 GMT\r\n'
response += b'Content-Type: text/plain\r\nContent-Length: 13\r\n\r\n'
response += b'Hello, world!'

def application(environ, start_response):
    serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    serversocket.bind(('0.0.0.0', 8080))
    serversocket.listen(1)
    serversocket.setblocking(0)

    data = 'Hello, World!\n'
    status = '200 OK'
    response_headers = [
        ('Content-type','text/plain'),
        ('Content-Length', str(len(data)))
    ]
    start_response(status, response_headers)

    epoll = select.epoll()
    epoll.register(serversocket.fileno(), select.EPOLLIN)

    try:
       connections = {}; requests = {}; responses = {}
       while True:
          events = epoll.poll(1)
          for fileno, event in events:
             if fileno == serversocket.fileno():
                connection, address = serversocket.accept()
                connection.setblocking(0)
                epoll.register(connection.fileno(), select.EPOLLIN)
                connections[connection.fileno()] = connection
                requests[connection.fileno()] = b''
                responses[connection.fileno()] = response
             elif event & select.EPOLLIN:
                requests[fileno] += connections[fileno].recv(1024)
                if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
                   epoll.modify(fileno, select.EPOLLOUT)
                   print('-'*40 + '\n' + requests[fileno].decode()[:-2])
             elif event & select.EPOLLOUT:
                byteswritten = connections[fileno].send(responses[fileno])
                responses[fileno] = responses[fileno][byteswritten:]
                if len(responses[fileno]) == 0:
                   epoll.modify(fileno, 0)
                   connections[fileno].shutdown(socket.SHUT_RDWR)
             elif event & select.EPOLLHUP:
                epoll.unregister(fileno)
                connections[fileno].close()
                del connections[fileno]
    finally:
       epoll.unregister(serversocket.fileno())
       epoll.close()

    serversocket.close()
    return iter([data])

Actions

Copy link

#22

Updated by bmbouter over 6 years ago

Status changed from NEW to ASSIGNED
Assignee set to bmbouter

Actions

Copy link

#23

Updated by bmbouter over 6 years ago

Based on input from the upstream mod_wsgi, he recommends adding an atexit handler() to goferd's daemon threads. After confirmation from @jortel, goferd does spawn daemon threads. @jortel will make a patch using atexit handlers, and I can test it. I'm also going to ask the upstream mod_wsgi developer if he can show us the Python bug where multithreading causes the double free in Python 2.7.

@dkliban mentioned in our discussion that perhaps we are experiencing this bug: https://bugs.python.org/issue19466

Actions

Copy link

#25

Updated by bmbouter over 6 years ago

I had to rebuild my reproducer environment yet again. So I once again reproduced it. I wanted to be sure that after many crashes all of them are attributed to this fix. That way if the fix works we know it's the whole fix. I found that it's 99% of the fix, but mongo also needs a fix. There is no fix needed for celery/kombu since 0 segfaults involved them. That makes sense since it only publishes tasks and the thread should only be created when asynchronously reading messages. The webserver never does that.

16 of the 17 segfault coredumps inspected involved the goferd deamon thread. 14 of those 16 existed inside of the PyTrash_thread_destroy_chain, and two others exited at a line like:

#0  0x00007f4dac9a737e in PyEval_EvalFrameEx (f=f@entry=0x7f4da4659230, throwflag=throwflag@entry=0) at /usr/src/debug/Python-2.7.13/Python/ceval.c:3401
3401        if (tstate->frame->f_exc_type != NULL)

There was one segfault that occurred in the mongo daemon thread, so that is also a possible issue. In practice of my test run it occurred 1 on 17 times. We could submit an atexit patch upstream to pymongo though or at least file an issue with them. I saved the coredump where their code segfaults.

Actions

Copy link

#26

Updated by bmbouter over 6 years ago

Here is a gofer diff that introduces an atexit handler that I'm testing:

diff --git a/common.py b/common.py
index 8e6014e..075caf4 100644
--- a/common.py
+++ b/common.py
@@ -13,6 +13,7 @@
 # Jeff Ortel <jortel@redhat.com>
 #

+import atexit
 import os
 import inspect
 import errno
@@ -169,6 +170,16 @@ class Thread(_Thread):
             log.info('thread:%s, ABORTED', thread.getName())
         return aborted

+    def start(self):
+        """
+        Start the thread.
+        """
+        def handler():
+            self.abort()
+            self.join()
+        atexit.register(handler)
+        super(Thread, self).start()
+
     def abort(self):
         """
         Abort event raised.

Actions

Copy link

#27

Updated by bmbouter over 6 years ago

After running a thousand restarts only 1 coredump was produced. Upon inspection it was not a segfault, but instead a SIGABORT being handled in a way that generates a coredump. I think this is unrelated entirely. Also consider that 1 hour without the fix generates 17+ coredumps, and after 12 hours of testing only 1 was produced.

I'm opening a patch against upstream gofer. I'm not sure how that merged patch needs to be included to be part of a release.

Actions

Copy link

#28

Updated by bmbouter over 6 years ago

Status changed from ASSIGNED to POST

PR posted here for upstream gofer: https://github.com/jortel/gofer/pull/78

Actions

Copy link

#29

Updated by bmbouter over 6 years ago

I don't plan to send a patch to PyMongo based on the 1 coredump I observed. I did report it to them however with all the details: https://jira.mongodb.org/browse/PYTHON-1442

Actions

Copy link

#30