Resource is not released if qpid connection is interrupted during task
If a worker's connection to qpid is disrupted while executing a task using a reserved resource, that resource may not be released once the task completes. This can lead to the worker becoming unavailable for scheduling tasks.
I could reproduce this in the vagrant-based development environment by following steps:
1. (Optional) patch to make publish artificially slow so the timing for reproducing is easier:
diff --git a/server/pulp/server/controllers/repository.py b/server/pulp/server/controllers/repository.py index 1cb9a31..5b22f77 100644 --- a/server/pulp/server/controllers/repository.py +++ b/server/pulp/server/controllers/repository.py @@ -1090,6 +1090,8 @@ def publish(repo_id, dist_id, publish_config_override=None, scheduled_call_id=No :raises pulp_exceptions.MissingResource: if distributor/repo pair does not exist """ + + import time; time.sleep(10) repo_obj = model.Repository.objects.get_repo_or_missing_resource(repo_id) dist = model.Distributor.objects.get_or_404(repo_id=repo_id, distributor_id=dist_id) dist_inst, dist_conf = _get_distributor_instance_and_config(repo_id, dist_id)
2. Trigger publish of a repo
3. Wait for log to show that publish and release_resource tasks were received by worker, e.g.
Received task: pulp.server.managers.repo.publish.publish[c5875ec5-c1ff-4b91-82c9-683d09f502b7] Received task: pulp.server.async.tasks._release_resource[59e02fab-331c-4d30-b1d4-7d8447a65629]
4. Restart qpidd: sudo systemctl restart qpidd
5. Wait for log to show that publish has succeeded, e.g.
Task pulp.server.managers.repo.publish.publish[c5875ec5-c1ff-4b91-82c9-683d09f502b7] succeeded in 10.040639891s: ...
6. Observe logs, and "reserved_resources" collection in database.
Actual behavior: log shows that _release_resource is never executed. An entry remains in reserved_resources collection for the used worker indefinitely, preventing scheduling of further tasks for that worker (except for the same resource).
Expected behavior: entry for the worker in reserved_resources is cleaned up in a timely fashion after the task completes.
I think the issue here is that the celery worker will "reserve" both the publish and release_resource tasks, but when the connection to qpid is broken, it will discard any reserved tasks.
Updated by bmbouter over 4 years ago
I can see how this is an issue. If the broker still has custody of the _release_resource() message and it looses connection with the client (via a restart) it will be purged because it's in a queue that is auto-deleting.
Celery 4 switches "dedicated queues" to be not-auto-deleting, so maybe upgrading that will be the resolution. Pulp already has support for Celery 4, and some future Pulp will only support Celery 4. Even with Celery 4 though, the Pulp user's tasks still cancel unexpectedly because of issue 489. If the queue is not auto-deleting, the _release_resource() won't get lost on restart.
Is this issue still present if tested against Celery 4?