Project

Profile

Help

Issue #2331

closed

The "unsafe_autoretry" feature does not retry with mongoengine.QuerySet codepaths

Added by mhrivnak about 8 years ago. Updated over 5 years ago.

Status:
CLOSED - WONTFIX
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
2.8.7
Platform Release:
OS:
CentOS 7
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 11
Quarter:

Description

If on a running pulp deployment you stop mongod, pulp_celerybeat will eventually die and produce the following log output, including a traceback.

Instead, it should handle the connection failure gracefully and recover when mongod becomes available again.

This behavior has been seen "in the wild" when a full disk caused mongod to die.

Oct 11 14:04:09 localhost.localdomain pulp[12792]: pulp.server.async.worker_watcher:DEBUG: 'worker-heartbeat' sent at time 2016-10-11 19:04:09.617800 from resource_manager@localhost.localdomain, received at time: 2016-10-11 18:04:09.635230
Oct 11 14:04:39 localhost.localdomain pulp[12792]: pulp.server.async.scheduler:ERROR: localhost:27017: [Errno 111] Connection refused
Oct 11 14:04:47 localhost.localdomain pulp[12792]: pulp.server.async.scheduler:DEBUG: Checking if pulp_workers, pulp_celerybeat, or pulp_resource_manager processes are missing for more than 300 seconds
Oct 11 14:04:49 localhost.localdomain pulp[12792]: kombu.transport.qpid:DEBUG: Attempting to connect to qpid with SASL mechanism ANONYMOUS
Oct 11 14:04:49 localhost.localdomain pulp[12792]: kombu.transport.qpid:INFO: Connected to qpid with SASL mechanism ANONYMOUS
Oct 11 14:04:49 localhost.localdomain pulp[12792]: pulp.server.async.scheduler:INFO: Event Monitor Starting
Oct 11 14:04:49 localhost.localdomain pulp[12792]: kombu.transport.qpid:DEBUG: Attempting to connect to qpid with SASL mechanism ANONYMOUS
Oct 11 14:04:49 localhost.localdomain pulp[12792]: kombu.transport.qpid:INFO: Connected to qpid with SASL mechanism ANONYMOUS
Oct 11 14:04:49 localhost.localdomain pulp[12792]: kombu.mixins:INFO: Connected to qpid://localhost:5672//
Oct 11 14:04:49 localhost.localdomain pulp[12792]: pulp.server.async.worker_watcher:DEBUG: 'worker-heartbeat' sent at time 2016-10-11 19:04:49.940497 from resource_manager@localhost.localdomain, received at time: 2016-10-11 18:04:49.961244
Oct 11 14:05:17 localhost.localdomain pulp[12792]: pulp.server.async.scheduler:ERROR: localhost:27017: [Errno 111] Connection refused
Oct 11 14:05:18 localhost.localdomain pulp[12792]: pulp.server.async.worker_watcher:DEBUG: 'scheduler-event' sent at time 2016-10-11 18:05:18.236002 from scheduler@localhost.localdomain, received at time: 2016-10-11 18:05:18.236004
Oct 11 14:05:20 localhost.localdomain pulp[12792]: pulp.server.async.scheduler:ERROR: localhost:27017: [Errno 111] Connection refused
Oct 11 14:05:30 localhost.localdomain pulp[12792]: kombu.transport.qpid:DEBUG: Attempting to connect to qpid with SASL mechanism ANONYMOUS
Oct 11 14:05:30 localhost.localdomain pulp[12792]: kombu.transport.qpid:INFO: Connected to qpid with SASL mechanism ANONYMOUS
Oct 11 14:05:30 localhost.localdomain pulp[12792]: pulp.server.async.scheduler:INFO: Event Monitor Starting
Oct 11 14:05:30 localhost.localdomain pulp[12792]: kombu.transport.qpid:DEBUG: Attempting to connect to qpid with SASL mechanism ANONYMOUS
Oct 11 14:05:30 localhost.localdomain pulp[12792]: kombu.transport.qpid:INFO: Connected to qpid with SASL mechanism ANONYMOUS
Oct 11 14:05:30 localhost.localdomain pulp[12792]: kombu.mixins:INFO: Connected to qpid://localhost:5672//
Oct 11 14:05:30 localhost.localdomain pulp[12792]: pulp.server.async.worker_watcher:DEBUG: 'worker-heartbeat' sent at time 2016-10-11 19:05:30.561665 from reserved_resource_worker-0@localhost.localdomain, received at time: 2016-10-11 18:05:30.579888
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504) beat raised exception <class 'pymongo.errors.ServerSelectionTimeoutError'>: ServerSelectionTimeoutError('localhost:27017: [Errno 111] Connection refused',)
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504) Traceback (most recent call last):
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib/python2.7/site-packages/celery/apps/beat.py", line 112, in start_scheduler
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     beat.start()
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib/python2.7/site-packages/celery/beat.py", line 462, in start
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     interval = self.scheduler.tick()
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py", line 252, in tick
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     worker_watcher.handle_worker_heartbeat(scheduler_event)
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib/python2.7/site-packages/pulp/server/async/worker_watcher.py", line 75, in handle_worker_heartbeat
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     worker = Worker.objects(name=event_info['worker_name']).first()
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 264, in first
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     result = queryset[0]
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 161, in __getitem__
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     return queryset._document._from_son(queryset._cursor[key],
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 565, in __getitem__
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     for doc in clone:
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 1097, in next
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     if len(self.__data) or self._refresh():
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 1019, in _refresh
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     self.__read_concern))
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib64/python2.7/site-packages/pymongo/cursor.py", line 850, in __send_message
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     **kwargs)
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib64/python2.7/site-packages/pymongo/mongo_client.py", line 777, in _send_message_with_response
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     server = topology.select_server(selector)
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib64/python2.7/site-packages/pymongo/topology.py", line 141, in select_server
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     address))
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)   File "/usr/lib64/python2.7/site-packages/pymongo/topology.py", line 117, in select_servers
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504)     self._error_message(selector))
Oct 11 14:05:48 localhost.localdomain pulp[12792]: celery.beat:CRITICAL: (12792-69504) ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused

Related issues

Related to Pulp - Issue #2423: As a user, I want pulp_celerybeat to survive a brief database outage without having to use the unsafe_autoretry feature.CLOSED - WORKSFORMEdaviddavisActions
Related to Pulp - Task #2417: Ensure all processes have initial and reconnect support for PostgreSQLCLOSED - CURRENTRELEASEbmbouter

Actions
Actions #2

Updated by mmccune@redhat.com about 8 years ago

  • Severity changed from 2. Medium to 3. High

I'd like to bump the severity to High as it isn't obvious from the user perspective of the interaction between the two. Generally we would always recommend a full restart of Mongo goes down but it would be much more resilient if services could handle outages better

Actions #3

Updated by amacdona@redhat.com about 8 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to 27
  • Severity changed from 3. High to 2. Medium
  • Triaged changed from No to Yes
Actions #4

Updated by jortel@redhat.com about 8 years ago

  • Sprint/Milestone changed from 27 to 28
Actions #5

Updated by daviddavis about 8 years ago

  • Assignee set to daviddavis
Actions #6

Updated by daviddavis about 8 years ago

  • Status changed from NEW to ASSIGNED
Actions #7

Updated by bmbouter about 8 years ago

I suspect it's actually all of the Pulp processes which are affected not just celerybeat. Confirming that and updating the issue would be good.

Quick background on the tasking system's usage of mongodb. There are really 3 entry points for Pulp. One is httpd itself, the second is celerybeat which uses scheduler.py and the third are the workers (both pulp_resource_manager and pulp_workers) share this one. All of them use a shared section of code to acquire the connection[0]. We manage the connection and the same code is used by all of our processes so that is why I think this issue affects all of our processes.

The idea of putting a few try/except in key places around the code would help, but it will be tough to get right with that approach for two reasons. (1) The exception could be raised at any point Pulp code calls into mongodb which is all over the place and (2) We want to retry-wait-and-continue at the moment the exception is raised so letting the exception propogate all the way up the stack would prevent the ability to resume at the point in the call stack.

We've had a similar problem before which caused us to "wrap" all the meaningful pyMongo methods to catch-and-retry. We do that here[1] with one of our features aptly named unsafe-retry. I think we want to do something similar here and put the try/except there and introduce a backoff and reconnect behavior with this exception type. Unlike the unsafe-retry feature which is turned on via a config in server.conf I think this backoff and retry would be enabled by default. See how the traceback flows through methods that are decorated by this section of code[1]. I think that is your opportunity.

[0]: https://github.com/pulp/pulp/blob/master/server/pulp/server/db/connection.py#L35-L168
[1]: https://github.com/pulp/pulp/blob/master/server/pulp/server/db/connection.py#L203-L260

Actions #8

Updated by daviddavis about 8 years ago

Digging into this bug, it looks like the places where we are failing with ServerSelectionTimeoutError, our retry code is not decorating the methods we need. There's two places we call decorate_instance, once for Documents[0] and once for pymongo Collections[1].

I tried adding 'next', 'first', etc to our list of methods to decorate but the object we're dealing with in this bug is a QuerySet. You can see that in the stacktrace and also you can see that first() calls pymongo's cursor to get the first item. This of course bypasses our retry code.

I debugged to confirm this suspicion and also noticed that there are other places that have the same problem (e.g. [2]). I was able to patch our QuerySet class to fix the problem[3] but that doesn't seem like an optimal solution in that we'd have to add/maintain yet another class from mongoengine/pymongo, and also, we'd have to go through QuerySet and add any methods that access the db to our list of decorated methods.

Not sure if there's perhaps some other solution to consider?

[0] https://git.io/vXuNj
[1] https://git.io/vX2FE
[2] https://git.io/vXu7c
[3] https://gist.github.com/daviddavis/58c1d81e3b2ee259667fb26d2eab631c

Actions #9

Updated by bmbouter about 8 years ago

  • Subject changed from pulp_celerybeat dies if mongod stops to The "unsafe_autoretry" feature does not retry with mongoengine.QuerySet codepaths

@daviddavis, great job on debugging all of this.

I retitled this bug to reflect what I've learned from your investigation. The unsafe_autoretry feature is supposed to catch-and-retry AutoReconnect exceptions. The ServerSelectionTimeoutError[0] exception is a subclass of AutoReconnect so that feature is designed to handle this case already. The unsafe_autoretry feature does not correctly catch-and-retry exceptions raised while calling into mongoengine.QuerySet objects.

This won't be an issue in Pulp3 because we won't be using mongodb at all. For the 2.y line, we have 2 options.

1. To fix this we would need to extend the unsafe_autoretry to cover QuerySet objects also.
2. The other option is to close it as WONTFIX since the issue will never go beyond the 2.y line

Actions #10

Updated by mhrivnak about 8 years ago

  • Sprint/Milestone changed from 28 to 29
Actions #11

Updated by daviddavis about 8 years ago

  • Related to Issue #2423: As a user, I want pulp_celerybeat to survive a brief database outage without having to use the unsafe_autoretry feature. added
Actions #12

Updated by daviddavis about 8 years ago

Opened up a related issue to look into having pulp_celerybeat not die when we can't connect to MongoDB:

https://pulp.plan.io/issues/2423

Actions #13

Updated by mhrivnak about 8 years ago

The unsafe_autoretry feature is not widely used, and there does not seem to be broad demand for this particular behavior. Let's close this and let pulp 3 make it irrelevant.

Actions #14

Updated by daviddavis about 8 years ago

  • Status changed from ASSIGNED to CLOSED - WONTFIX
Actions #15

Updated by daviddavis about 8 years ago

  • Related to Task #2417: Ensure all processes have initial and reconnect support for PostgreSQL added
Actions #16

Updated by bmbouter almost 7 years ago

  • Sprint set to Sprint 11
Actions #17

Updated by bmbouter almost 7 years ago

  • Sprint/Milestone deleted (29)
Actions #18

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF