Issue #2835
closedTasks stuck in waiting after restart of pulp services
Description
We observed tasks being stuck in "Waiting" state and not beginning to run multiple times in the past (often after upgrading Pulp (2.13.2 in this case). Sometimes also after a reboot.) This looks similar to issue #1838
For example, the task list looks like this when the problem occurs:
$ pulp-admin tasks list
+----------------------------------------------------------------------+
Tasks
+----------------------------------------------------------------------+
Operations: publish
Resources: e2 (repository)
State: Waiting
Start Time: Unstarted
Finish Time: Incomplete
Task Id: 4a28ec8d-c02e-4a64-b00f-e39adc9e7184
Operations: publish
Resources: e3 (repository)
State: Waiting
Start Time: Unstarted
Finish Time: Incomplete
Task Id: 0c7e8a5a-d3da-401f-b6da-220812a51471
Operations: publish
Resources: e4 (repository)
State: Waiting
Start Time: Unstarted
Finish Time: Incomplete
Task Id: c43f74ae-00a4-435c-89b8-74011e8f2fa7
Operations: sync
Resources: e1 (repository)
State: Waiting
Start Time: Unstarted
Finish Time: Incomplete
Task Id: f5e26ee1-158d-4323-bb0e-d2d954ae07d5
Moreover, new tasks will not begin to run. In this situation, restarting the pulp services does not help. The only remedy we came up with is to cancel all waiting tasks and then restart the pulp services.
The situation seems to occur when pulp services are restarted with tasks being queued. The following script generates this situation for me in a vagrant pulp dev environment (docker, 2.13 dev branch). The number of workers is set to 1 in order to ensure that tasks get queued.
set -x
sudo bash -c "echo -e 'PULP_CONCURRENCY=1\nPYTHONIOENCODING="UTF-8"' > /etc/default/pulp_workers"
sudo systemctl restart pulp_celerybeat.service pulp_resource_manager.service pulp_workers.service
sleep 5
for i in e1 e2 e3 e4 ; do pulp-admin rpm repo create --repo-id $i --feed https://dl.fedoraproject.org/pub/epel/7/x86_64/ --serve-https true --relative-url $i ; done
pulp-admin rpm repo sync run --repo-id e1 --bg ; for i in e2 e3 e4 ; do pulp-admin rpm repo publish run --repo-id $i --bg ; done
sleep 10
pulp-admin tasks list
sudo systemctl restart pulp_celerybeat.service pulp_resource_manager.service pulp_workers.service
sleep 2m
pulp-admin rpm repo sync run --repo-id e1 --bg
pulp-admin tasks list
The end result is the task list shown above.
Files
Related issues
Updated by bmbouter over 7 years ago
I was able to reproduce the issue. Upon further investigation I can see at least three issues manifesting with this one reproducer...
- [Issue A] The resource_manager queue looses a currently running _queue_reserved_task if the resource manager is restarted with
sudo systemctl restart pulp_resource_manager
. The task is lost from the queue but still has an incorrect TaskStatus record showing as waiting which will never run. Note that if yousudo pkill -9 -f resource_manager
and thesudo systemctl start pulp_resource_manager
it does not loose the task.
- [Issue B] Restarting
pulp_workers
while a worker is processing a task leaves the worker in a broker-forever state. The reproducer causes 1 task to be running on the worker (with a _release_resource) task still in the worker's .dq queue and the 3 publishes left on the resource manager:
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=============================================================================================================================
reserved_resource_worker-0@dev.dq Y Y 1 2 1 931 1.90k 967 1 2
resource_manager Y 3 42 39 3.71k 49.6k 45.9k 1 2
After running sudo systemctl restart pulp_workers
the process does (verified by the pids changing) but the worker that starts does not start processing tasks. The messages are ready for the worker in their .dq queue:
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=============================================================================================================================
reserved_resource_worker-0@dev.dq Y Y 2 2 0 1.95k 1.95k 0 1 2
resource_manager Y 2 42 40 3.71k 49.6k 45.9k 1 2
Note that if you do the same reproducer, but you don't run sudo systemctl restart pulp_workers
and instead you run sudo pkill -9 -f reserved_resource_worker;sudo systemctl stop pulp_workers;sudo systemctl start pulp_workers
the worker starts up in a functional state.
- [Issue C] The systemd file for pulp_workers will not start if the the workers are stopped outside of systemd. Specifically with workers running:
sudo systemctl start pulp_workers # Start pulp_workers with systemd (you can verify they run with ps -awfux | grep celery)
sudo pkill -9 -f reserved_resource_worker # Stop the workers outside of systemd
sudo systemctl start pulp_workers # Start the workers again with systemd
ps -awfux | grep celery # Note that no workers are running
This is an issue with how we chain-load the workers.
Updated by ttereshc over 7 years ago
This issue will be split into 3 issues (as of previous comment) and triaged then.
Updated by ttereshc over 7 years ago
- Priority changed from Normal to High
- Sprint/Milestone set to 40
- Severity changed from 2. Medium to 3. High
- Triaged changed from No to Yes
This issue is triaged as "Issue B".
@dralley will file a new issue for "Issue A".
Updated by dalley over 7 years ago
Issue A has been filed here: https://pulp.plan.io/issues/2861
Updated by dalley over 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
Updated by ttereshc over 7 years ago
- Related to Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run added
Updated by jortel@redhat.com about 7 years ago
- Sprint/Milestone changed from 43 to 44
Updated by dalley about 7 years ago
- Status changed from ASSIGNED to MODIFIED
Applied in changeset pulp|881a5fb9fdaf9813d0dbb576ab7ca3c7b3dc8476.
Updated by dalley about 7 years ago
- Status changed from MODIFIED to POST
I accidentally pushed this to pulp/2.14-dev instead of my own Github. Apologies.
PR is here: https://github.com/pulp/pulp/pull/3137
Updated by dalley about 7 years ago
- File gdb_process_dumps.zip gdb_process_dumps.zip added
Summary:¶
Restarting a worker that is currently executing a task will leave that workers in a broken state. This issue can reproduced on both Celery 3.1.x and Celery 4.x, but only while using Qpid as a broker. I was not able to reproduce this issue while using RabbitMQ as a broker, using either version of Celery. I was also not able to reproduce this issue on versions of Pulp prior to 2.13. The means of shutting down the workers also does not appear to matter, e.g. "systemctl restart" and "pkill -9 celery; prestart" both work the same.
Reproduction Steps:¶
1. Start pulp
2. Begin a task (e.g. sync)
3. While the task is running, restart the pulp worker running the task
4. After the worker has restarted, begin another task
5. Observe that the tasks are perpetually stuck in waiting
(Exact steps)
1. prestart
2. pulp-admin rpm repo sync run --repo-id zoo
(While the task is still running, in another terminal)
3a. sudo systemctl restart pulp_workers
OR
3b. pkill -9 celery; prestart
4. pulp-admin rpm repo sync run --repo-id zoo
5. Observe that the task is perpetually stuck in waiting
Symptoms:¶
Only the worker which was running a task at the time of the worker restart is rendered frozen. Future work assigned to this worker will not be executed - this can cause it to seem as though Pulp is entirely frozen, because work is typically assigned to worker zero. Other workers are fine.
From the CLI:
[vagrant@pulp2 ~]$ pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
Synchronizing Repository [zoo]
+----------------------------------------------------------------------+
This command may be exited via ctrl+c without affecting the request.
Downloading metadata...
[\]
... completed
Downloading repository content...
[-]
[==================================================] 100%
RPMs: 32/32 items
Delta RPMs: 0/0 items
Task Canceled
[vagrant@pulp2 ~]$ pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
Synchronizing Repository [zoo]
+----------------------------------------------------------------------+
This command may be exited via ctrl+c without affecting the request.
[\]
Waiting to begin...
We can see that Pulp worker is not consuming work from it's associated dedicated queue - the messages sent to the queue by dispatching a new task are not being read out of the queue by the worker.
State of queues before running any tasks
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
===================================================================================================================================
reserved_resource_worker-0@pulp2.dev.dq Y Y 0 0 0 0 0 0 1 2
resource_manager Y 0 36 36 0 43.1k 43.1k 1 2
State of queues after restarting worker mid-task
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
===================================================================================================================================
reserved_resource_worker-0@pulp2.dev.dq Y Y 0 0 0 0 0 0 1 2
resource_manager Y 0 40 40 0 47.8k 47.8k 1 2
State of queues after dispatching a new task to the frozen worker
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
===================================================================================================================================
reserved_resource_worker-0@pulp2.dev.dq Y Y 2 2 0 1.93k 1.93k 0 1 2
resource_manager Y 0 41 41 0 49.0k 49.0k 1 2
If you restart the worker immediately after restarting it mid-task, without assigning any new work to the queue, it will restart cleanly and will continue on accepting work normally.
How to "fix":¶
List all of the tasks in waiting with 'pulp-admin tasks list', cancel them all with 'pulp-admin tasks list cancel', and then restart the worker
- If you cancel the new task without restarting the worker and then issue a new task, that task will become hung just the same
- If you restart the worker without cancelling a current hung task, the task will be marked cancelled on reboot of the worker, but future work dispatched to the worker will remain hung
- If you do not assign new work to the worker, then restart the worker, it will continue accepting work normally
Origin of this issue:¶
A change was introduced in Pulp 2.13 which used of Celery's Bootsteps feature as a means for workers to write their own heartbeats to the database. This appears to have triggered the conditions to make this bug emergent.
That PR is here: https://github.com/pulp/pulp/pull/2922
After this change, when the function we registered to be executed on the celeryd_after_setup signal is run, the call to _delete_worker() -> cancel() -> controller.revoke() appears cause the worker to become frozen, whereas previously it had worked fine.
https://github.com/pulp/pulp/blob/master/server/pulp/server/async/tasks.py#L663
https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L158
The exact reason that using bootsteps causes the call to revoke() to put the worker in a bad state is still undetermined.
Mitigation:¶
My current pull request [0] to Pulp can mitigate this issue by registering a new terminate() function on the Celery consumer (called as part of the shutdown process of the worker) which calls _delete_worker() itself. This only fixes the case where the workers are shutdown "kindly", but this is going to be the most common case, and the fix definitely does work 100% for that case.
[0] https://github.com/pulp/pulp/pull/3137/files
Pulp already code which was supposed to accomplish this using the worker_shutdown signal [1], but upon testing I found that while the signal was being called correctly, the qpid.messaging thread had already been stopped at that point, and the resulting traceback prevented the code from reaching the portion of _delete_worker() where the Pulp task is marked cancelled. Therefore, when the worker restarts, it attempts to clean up those workers, calling revoke() and getting frozen up.
[1] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L171
This approach fixes the issue in the case where the workers are shut down "nicely" aka with SIGTERM. An expanded fix also solves the "pkill" case by moving the pulp worker cleanup and initialization code from the celeryd_after_setup signal handler to the start() method on the Consumer bootstep. However, I have no early idea WHY this works, or why having that code celeryd_after_setup broke when we started using bootsteps in the first place. So I'm not comfortable pushing that expanded fix until those questions are answered.
Debugging:¶
Attached is the output of running "thread apply all py-bt" on GDB coredumps of the parent and child Celery processes in 4 states.
nowork_* => Clean processes before any work whatsoever has been dispatched
postsync_* => Processes after work has been dispatched normally, without restarting the workers mid-execution
restarted_* => Processes after the pulp workers have been restarted mid-execution of a task
hung_* => Processes after being restarted and having new work dispatched, which are now hung
All 4 parent dumps are identical. The two "clean" child dumps, nowork_child.txt and postsync_child.txt, are also identical to each other, as are the two "dirty" child dumps, hung_child.txt and restarted_child.txt. So, for the sake of comparing differences between the dumps, we only need to look at two - which may as well be nowork_child.txt and hung_child.txt
Added by dalley about 7 years ago
Added by dalley about 7 years ago
Revision 066d4ea8 | View on GitHub
Mitigate issue causing workers to be left broken
Fixes an issue where workers can be left in a broken state if they are restarted while processing a task.
Only fixes for the SIGTERM shutdown case.
re #2835 https://pulp.plan.io/issues/2835
(cherry picked from commit b565bd357bc698f11b68d1975d7c258c303f9b31)
Updated by dalley about 7 years ago
The migitation patch was applied in PR https://github.com/pulp/pulp/pull/3137/, so now we only need to worry about the case where workers get killed unceremoniously (i.e. OOM, pkill, etc.)
Updated by dalley about 7 years ago
- File celery_startup_log_bootsteps_and_signal.txt celery_startup_log_bootsteps_and_signal.txt added
- File celery_startup_log_signal_only.txt celery_startup_log_signal_only.txt added
The aforementioned "full workaround" I found by tinkering, but which I don't understand how it works, is here:
https://github.com/pulp/pulp/compare/master...dralley:2835-hack-fix
I consider this to be hacky... this doesn't address why the introduction of bootsteps broke previously-functional code within the celeryd_after_setup signal handler.
I've attached two additional logs, where I started some minimal celery workers which are structurally identical to the pulp workers, and watched their startup sequences.
test_worker1 (similar to the pre-2.13 pulp worker code)
from celery import Celery
from celery.signals import celeryd_after_setup
@celeryd_after_setup.connect
def test_signal(sender, instance, **kwargs):
print('{0!r} celeryd_after_setup signal fired'.format(sender))
app = Celery(broker='amqp://')
test_worker2 (similar to the 2.13+ pulp_worker code)
from celery import Celery
from celery import bootsteps
from celery.signals import celeryd_after_setup
class Reproducer(bootsteps.StartStopStep):
def __init__(self, parent, **kwargs):
print('{0!r} is in init'.format(parent))
def start(self, worker):
print('{0!r} is starting up'.format(worker))
self.timer_ref = worker.timer.call_repeatedly(
5,
self.do_work,
(worker, ),
priority=10,
)
def do_work(self, worker):
print('{0!r} heartbeat'.format(worker))
def stop(self, parent):
print('{0!r} is stopping'.format(parent))
def shutdown(self, parent):
print('{0!r} is shutting down'.format(parent))
@celeryd_after_setup.connect
def test_signal(sender, instance, **kwargs):
print('{0!r} celeryd_after_setup signal fired'.format(sender))
app = Celery(broker='amqp://')
app.steps['consumer'].add(Reproducer)
There are differences in the logs, but I don't see those differences as being meaningful, since they all occur after the point that we care about.
Updated by jortel@redhat.com about 7 years ago
- Sprint/Milestone changed from 45 to 46
Updated by dalley about 7 years ago
I made one other discovery:
I wanted to see what would happen if I attached the heartbeat bootstep to the worker as it had been originally, before I made the fix in this PR: https://github.com/pulp/pulp/pull/2984
It works! However, that fix was made for a reason - we needed the heartbeats to stop when the broker connection was lost, and restart when it was regained. The celery "worker" (back to overlapping terms.. this is the "worker" component not the broad concept of a worker) has no knowledge of the state of the broker, so the functionality would need to be attached to the celery consumer component.
https://github.com/pulp/pulp/pull/2984/files#diff-ac9a188d0b9425fa260a49c7def6aa0fL124
So, reverting the change from that PR fixes #2835 but breaks reconnect support. Hopefully this helps us narrow down the cause further.
To reproduce this, check out commit 3a3b5f020eca1d019f51301ffe5d9bc2dbffcdb2 on the pulp repo (one commit prior to the PR in question)
Added by dalley almost 7 years ago
Revision aa7a1de2 | View on GitHub
Fixes workers crashing on restart edge case
Fixes an issue where workers would crash when attempting to revoke() a previously running task on startup. If the user sends SIGKILL to the worker while running a task, the worker will attempt to mark the task as stale when restarted. This causes a hang for yet-unknown reasons. Calling revoke() is not necessary when a dead worker is being cleaned up because all work goes through dedicated queues which disappear when the worker does. We leave revoke() active only when a task is being cancelled outside of worker cleanup.
closes #2835 https://pulp.plan.io/issues/2835
modified: server/pulp/server/async/tasks.py
modified: server/test/unit/server/async/test_tasks.py
Updated by dalley almost 7 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulp|aa7a1de219b02bb0aa5a3674a72d77a653ee968f.
Added by dalley almost 7 years ago
Revision 76b756f3 | View on GitHub
Fixes workers crashing on restart edge case
Fixes an issue where workers would crash when attempting to revoke() a previously running task on startup. If the user sends SIGKILL to the worker while running a task, the worker will attempt to mark the task as stale when restarted. This causes a hang for yet-unknown reasons. Calling revoke() is not necessary when a dead worker is being cleaned up because all work goes through dedicated queues which disappear when the worker does. We leave revoke() active only when a task is being cancelled outside of worker cleanup.
closes #2835 https://pulp.plan.io/issues/2835
modified: server/pulp/server/async/tasks.py
modified: server/test/unit/server/async/test_tasks.py
(cherry picked from commit aa7a1de219b02bb0aa5a3674a72d77a653ee968f)
Updated by dalley almost 7 years ago
Applied in changeset pulp|76b756f3965bc5603f99c9427201a1a00b9fa585.
Updated by bmbouter over 6 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Mitigate issue causing workers to be left broken
Fixes an issue where workers can be left in a broken state if they are restarted while processing a task.
Only fixes for the SIGTERM shutdown case.
re #2835 https://pulp.plan.io/issues/2835