Project

Profile

Help

Issue #4743

closed

Celery workers may deadlock after cancel if "celery inspect" has been used

Added by rmcgover about 5 years ago. Updated almost 4 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
No
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Quarter:

Description

With Pulp 2.x on RHEL7, celery reserved_resource workers can deadlock if "celery inspect" has been used on the system and a task has been cancelled & terminated on the worker.

Steps to reproduce

1. On a Pulp environment, arrange for a task to be slow enough so that it'll be possible to cancel the task while in progress.

An easy way to do this is to patch delete_all_orphans to introduce some sleep, e.g.

--- a/server/pulp/server/managers/content/orphan.py
+++ b/server/pulp/server/managers/content/orphan.py
@@ -163,16 +163,18 @@ class OrphanManager(object):
     @staticmethod
     def delete_all_orphans():
         """
         Delete all orphaned content units.

         :return: count of units deleted indexed by content_type_id
         :rtype: dict
         """
+        import time
+        _logger.warn("About to sleep"); time.sleep(60); _logger.warn("Slept")
         ret = {}
         for content_type_id in content_types_db.all_type_ids():
             count = OrphanManager.delete_orphans_by_type(content_type_id)
             if count > 0:
                 ret[content_type_id] = count

         for content_type_id in plugin_api.list_unit_models():
             count = OrphanManager.delete_orphan_content_units_by_type(content_type_id)

Note: tasks which install a SIGTERM handler, such as repo publish, should be avoided as the signal handler may prevent triggering the issue.

2. Start one reserved_resource worker

3. Check worker state by "celery inspect"

# celery -A pulp.server.async.app inspect reserved
-> resource_manager@bfb9e654fdc2: OK
    - empty -
-> reserved_resource_worker-1@bfb9e654fdc2: OK
    - empty -

4. Start a task in the background:

# pulp-admin -u admin -p admin orphan remove --all --bg

5. Cancel the running task:

# pulp-admin -u admin -p admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  
Resources:   orphans (content_unit)
State:       Running
Start Time:  2019-04-29T05:29:42Z
Finish Time: Incomplete
Task Id:     c089817a-7d80-4782-a960-a603179b18df
Worker Name: reserved_resource_worker-1@bfb9e654fdc2

# pulp-admin -u admin -p admin tasks cancel --task-id c089817a-7d80-4782-a960-a603179b18df
Task cancel is successfully initiated.

6. Try to start another task:

# pulp-admin -u admin -p admin orphan remove --all --bg

Expected behavior

- Task cancelled at step (5) is terminated.
- Task queued at step (6) is executed successfully.

Actual behavior

- Task cancelled at step (5) is terminated.
- Worker starts producing log messages like this:

pulp: celery.concurrency.asynpool:ERROR: Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-2, started daemon)>

- Task queued at step (6) is not executed, nor any other tasks assigned to that worker.

Additional info

Reproduced with these packages:

# rpm -qa | egrep 'pulp|kombu' | sort
pulp-admin-client-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
pulp-rpm-admin-extensions-2.20.0-0.1.alpha.201904230511git2eff86d.el7.noarch
pulp-rpm-plugins-2.20.0-0.1.alpha.201904230511git2eff86d.el7.noarch
pulp-selinux-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
pulp-server-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-isodate-0.5.0-4.pulp.el7.noarch
python-pulp-bindings-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-client-lib-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-common-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-oid_validation-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-repoauth-2.20.0-0.1.alpha.201904230515gitc86c433.el7.noarch
python-pulp-rpm-common-2.20.0-0.1.alpha.201904230511git2eff86d.el7.noarch
python2-kombu-4.0.2-12.el7.noarch

That particular version (or newer) of python2-kombu must be used, otherwise another issue https://pulp.plan.io/issues/4581 will block reproducing this one as cancel won't be able to terminate tasks at all.

"celery inspect" triggers the issue because it causes the celery worker to create a new qpid connection & associated thread. When celery creates a new child process after terminating some task, the new child tries to close the qpid connection, but since fork() did not copy the related thread for that connection, the close attempt hangs. It's a known issue which has been partially resolved in the past, e.g. see https://issues.apache.org/jira/browse/QPID-5637, but is still reproducible here.

While it's not required to use "celery inspect" for general usage of Pulp, it's one of the standard tools for monitoring a celery cluster, and inability to use it makes Pulp more difficult to manage.

Also available in: Atom PDF