Project

Profile

Help

Issue #4743

closed

Celery workers may deadlock after cancel if "celery inspect" has been used

Added by rmcgover almost 5 years ago. Updated almost 4 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
No
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

With Pulp 2.x on RHEL7, celery reserved_resource workers can deadlock if "celery inspect" has been used on the system and a task has been cancelled & terminated on the worker.

Steps to reproduce

1. On a Pulp environment, arrange for a task to be slow enough so that it'll be possible to cancel the task while in progress.

An easy way to do this is to patch delete_all_orphans to introduce some sleep, e.g.

--- a/server/pulp/server/managers/content/orphan.py
+++ b/server/pulp/server/managers/content/orphan.py
@@ -163,16 +163,18 @@ class OrphanManager(object):
     @staticmethod
     def delete_all_orphans():
         """
         Delete all orphaned content units.

         :return: count of units deleted indexed by content_type_id
         :rtype: dict
         """
+        import time
+        _logger.warn("About to sleep"); time.sleep(60); _logger.warn("Slept")
         ret = {}
         for content_type_id in content_types_db.all_type_ids():
             count = OrphanManager.delete_orphans_by_type(content_type_id)
             if count > 0:
                 ret[content_type_id] = count

         for content_type_id in plugin_api.list_unit_models():
             count = OrphanManager.delete_orphan_content_units_by_type(content_type_id)

Note: tasks which install a SIGTERM handler, such as repo publish, should be avoided as the signal handler may prevent triggering the issue.

2. Start one reserved_resource worker

3. Check worker state by "celery inspect"

# celery -A pulp.server.async.app inspect reserved
-> resource_manager@bfb9e654fdc2: OK
    - empty -
-> reserved_resource_worker-1@bfb9e654fdc2: OK
    - empty -

4. Start a task in the background:

# pulp-admin -u admin -p admin orphan remove --all --bg

5. Cancel the running task:

# pulp-admin -u admin -p admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  
Resources:   orphans (content_unit)
State:       Running
Start Time:  2019-04-29T05:29:42Z
Finish Time: Incomplete
Task Id:     c089817a-7d80-4782-a960-a603179b18df
Worker Name: reserved_resource_worker-1@bfb9e654fdc2

# pulp-admin -u admin -p admin tasks cancel --task-id c089817a-7d80-4782-a960-a603179b18df
Task cancel is successfully initiated.

6. Try to start another task:

# pulp-admin -u admin -p admin orphan remove --all --bg

Expected behavior

- Task cancelled at step (5) is terminated.
- Task queued at step (6) is executed successfully.

Actual behavior

- Task cancelled at step (5) is terminated.
- Worker starts producing log messages like this:

pulp: celery.concurrency.asynpool:ERROR: Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-2, started daemon)>

- Task queued at step (6) is not executed, nor any other tasks assigned to that worker.

Additional info

Reproduced with these packages:

# rpm -qa | egrep 'pulp|kombu' | sort
pulp-admin-client-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
pulp-rpm-admin-extensions-2.20.0-0.1.alpha.201904230511git2eff86d.el7.noarch
pulp-rpm-plugins-2.20.0-0.1.alpha.201904230511git2eff86d.el7.noarch
pulp-selinux-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
pulp-server-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-isodate-0.5.0-4.pulp.el7.noarch
python-pulp-bindings-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-client-lib-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-common-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-oid_validation-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-repoauth-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-rpm-common-2.20.0-0.1.alpha.201904230511git2eff86d.el7.noarch
python2-kombu-4.0.2-12.el7.noarch

That particular version (or newer) of python2-kombu must be used, otherwise another issue https://pulp.plan.io/issues/4581 will block reproducing this one as cancel won't be able to terminate tasks at all.

"celery inspect" triggers the issue because it causes the celery worker to create a new qpid connection & associated thread. When celery creates a new child process after terminating some task, the new child tries to close the qpid connection, but since fork() did not copy the related thread for that connection, the close attempt hangs. It's a known issue which has been partially resolved in the past, e.g. see https://issues.apache.org/jira/browse/QPID-5637, but is still reproducible here.

While it's not required to use "celery inspect" for general usage of Pulp, it's one of the standard tools for monitoring a celery cluster, and inability to use it makes Pulp more difficult to manage.

Actions #1

Updated by ipanova@redhat.com almost 5 years ago

  • Status changed from NEW to CLOSED - WONTFIX

After conversation with the team we advise to switch to RabbitMQ, mentioned issue is fully handled in this broker. Are there any obstacles not to do so?
Also we do not recommend to use `celery inspect` because it is not reliable for monitoring purpose, there are some discrepancies in what heartbeat is stored in the DB and what celery inspect output gives, so instead `status api` is recommended.

Thirdly, in case it is really critical to use qpid and switch to RabbitMQ is not possible, fixes mentioned by Rohan in the Jira ticket can be sent to kombu/qpid upstream.

Actions #2

Updated by rmcgover almost 5 years ago

There's no known reasons why we couldn't use RabbitMQ. It'll need a lot of testing on our side of course. Pulp docs say that it's developed and tested with qpid and this seems to imply it's the preferred option.

Also we do not recommend to use `celery inspect` because it is not reliable for monitoring purpose, there are some discrepancies in what heartbeat is stored in the DB and what celery inspect output gives

Those discrepancies are exactly the reason why it's useful... since Pulp is managing task status in a separate layer from celery, sometimes the information in Pulp's DB/API is wrong or incomplete. For example, mongo can have an outage after a celery task completed but before Pulp managed to save the task report, leaving it marked as in progress. "celery inspect" can show up such a discrepancy (e.g. from a nagios check or similar monitoring tool).

Actions #3

Updated by bmbouter almost 5 years ago

rmcgover I believe it's fully safe to switch to rabbitMQ for your production environment, and it will resolve these issues for you. Celery is designed and tested to work with rabbitMQ, and several people in the Pulp community run Pulp on RabbitMQ for their production environments. The RabbitMQ driver doesn't have the incompatibilities than the Qpid driver does with Celery so you're avoiding problems in that area. Also all the Celery tooling should work great.

With all of this praise for RabbitMQ you may wonder why Pulp doesn't use it by default. Perhaps Pulp should have. Regardless, we aren't able to make that switch at this point in Pulp2's history for the whole project. You're environment could switch though.

Also available in: Atom PDF