Issue #3171


Resource is not released if qpid connection is interrupted during task

Added by rmcgover over 4 years ago. Updated about 2 years ago.

Start date:
Due date:
Estimated time:
3. High
Platform Release:
Sprint Candidate:
Pulp 2
Sprint 30


If a worker's connection to qpid is disrupted while executing a task using a reserved resource, that resource may not be released once the task completes. This can lead to the worker becoming unavailable for scheduling tasks.

I could reproduce this in the vagrant-based development environment by following steps:

1. (Optional) patch to make publish artificially slow so the timing for reproducing is easier:

diff --git a/server/pulp/server/controllers/ b/server/pulp/server/controllers/
index 1cb9a31..5b22f77 100644
--- a/server/pulp/server/controllers/
+++ b/server/pulp/server/controllers/
@@ -1090,6 +1090,8 @@ def publish(repo_id, dist_id, publish_config_override=None, scheduled_call_id=No

     :raises pulp_exceptions.MissingResource: if distributor/repo pair does not exist
+    import time; time.sleep(10)
     repo_obj = model.Repository.objects.get_repo_or_missing_resource(repo_id)
     dist = model.Distributor.objects.get_or_404(repo_id=repo_id, distributor_id=dist_id)
     dist_inst, dist_conf = _get_distributor_instance_and_config(repo_id, dist_id)

2. Trigger publish of a repo

3. Wait for log to show that publish and release_resource tasks were received by worker, e.g.

Received task: pulp.server.managers.repo.publish.publish[c5875ec5-c1ff-4b91-82c9-683d09f502b7]
Received task: pulp.server.async.tasks._release_resource[59e02fab-331c-4d30-b1d4-7d8447a65629]

4. Restart qpidd: sudo systemctl restart qpidd

5. Wait for log to show that publish has succeeded, e.g.

Task pulp.server.managers.repo.publish.publish[c5875ec5-c1ff-4b91-82c9-683d09f502b7] succeeded in 10.040639891s: ...

6. Observe logs, and "reserved_resources" collection in database.

Actual behavior: log shows that _release_resource is never executed. An entry remains in reserved_resources collection for the used worker indefinitely, preventing scheduling of further tasks for that worker (except for the same resource).

Expected behavior: entry for the worker in reserved_resources is cleaned up in a timely fashion after the task completes.

Additional info:

I think the issue here is that the celery worker will "reserve" both the publish and release_resource tasks, but when the connection to qpid is broken, it will discard any reserved tasks.

Actions #1

Updated by dalley over 4 years ago

  • Severity changed from 2. Medium to 3. High
  • Triaged changed from No to Yes
Actions #2

Updated by dalley over 4 years ago

  • Groomed changed from No to Yes
Actions #3

Updated by rchan over 4 years ago

  • Sprint/Milestone set to 52
Actions #4

Updated by bmbouter over 4 years ago

I can see how this is an issue. If the broker still has custody of the _release_resource() message and it looses connection with the client (via a restart) it will be purged because it's in a queue that is auto-deleting.

Celery 4 switches "dedicated queues" to be not-auto-deleting, so maybe upgrading that will be the resolution. Pulp already has support for Celery 4, and some future Pulp will only support Celery 4. Even with Celery 4 though, the Pulp user's tasks still cancel unexpectedly because of issue 489. If the queue is not auto-deleting, the _release_resource() won't get lost on restart.

Is this issue still present if tested against Celery 4?

Actions #5

Updated by dalley over 4 years ago

I confirmed that it is not reproducible in a Celery 4 environment. All Pulp versions 2.13+ should fully support Celery 4.

Actions #6

Updated by dalley over 4 years ago

  • Status changed from NEW to CLOSED - CURRENTRELEASE
Actions #7

Updated by bmbouter over 4 years ago

  • Sprint set to Sprint 30
Actions #8

Updated by bmbouter over 4 years ago

  • Sprint/Milestone deleted (52)
Actions #9

Updated by bmbouter about 3 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF