Project

Profile

Help

Issue #2835

Tasks stuck in waiting after restart of pulp services

Added by gmbnomis over 3 years ago. Updated over 1 year ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.13.1
Platform Release:
2.15.2
OS:
RHEL 7
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 31
Quarter:

Description

We observed tasks being stuck in "Waiting" state and not beginning to run multiple times in the past (often after upgrading Pulp (2.13.2 in this case). Sometimes also after a reboot.) This looks similar to issue #1838

For example, the task list looks like this when the problem occurs:

$ pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+

Operations:  publish
Resources:   e2 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     4a28ec8d-c02e-4a64-b00f-e39adc9e7184

Operations:  publish
Resources:   e3 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     0c7e8a5a-d3da-401f-b6da-220812a51471

Operations:  publish
Resources:   e4 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     c43f74ae-00a4-435c-89b8-74011e8f2fa7

Operations:  sync
Resources:   e1 (repository)
State:       Waiting
Start Time:  Unstarted
Finish Time: Incomplete
Task Id:     f5e26ee1-158d-4323-bb0e-d2d954ae07d5

Moreover, new tasks will not begin to run. In this situation, restarting the pulp services does not help. The only remedy we came up with is to cancel all waiting tasks and then restart the pulp services.

The situation seems to occur when pulp services are restarted with tasks being queued. The following script generates this situation for me in a vagrant pulp dev environment (docker, 2.13 dev branch). The number of workers is set to 1 in order to ensure that tasks get queued.

set -x

sudo bash -c "echo -e 'PULP_CONCURRENCY=1\nPYTHONIOENCODING="UTF-8"' > /etc/default/pulp_workers"

sudo systemctl restart pulp_celerybeat.service pulp_resource_manager.service pulp_workers.service
sleep 5

for i in e1 e2 e3 e4 ; do pulp-admin rpm repo create --repo-id $i --feed https://dl.fedoraproject.org/pub/epel/7/x86_64/ --serve-https true --relative-url $i ; done

pulp-admin rpm repo sync run --repo-id e1 --bg ; for i in e2 e3 e4 ; do pulp-admin rpm repo publish run --repo-id $i --bg ; done
sleep 10
pulp-admin tasks  list
sudo systemctl restart pulp_celerybeat.service pulp_resource_manager.service pulp_workers.service
sleep 2m
pulp-admin rpm repo sync run --repo-id e1 --bg
pulp-admin tasks  list

The end result is the task list shown above.


Related issues

Related to Pulp - Issue #2734: task group never quite finishes if pulp is restarted in the middle of task runCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision b565bd35 View on GitHub
Added by dalley about 3 years ago

Mitigate issue causing workers to be left broken

Fixes an issue where workers can be left in a broken state if they are restarted while processing a task.

Only fixes for the SIGTERM shutdown case.

re #2835 https://pulp.plan.io/issues/2835

Revision 066d4ea8 View on GitHub
Added by dalley about 3 years ago

Mitigate issue causing workers to be left broken

Fixes an issue where workers can be left in a broken state if they are restarted while processing a task.

Only fixes for the SIGTERM shutdown case.

re #2835 https://pulp.plan.io/issues/2835

(cherry picked from commit b565bd357bc698f11b68d1975d7c258c303f9b31)

Revision aa7a1de2 View on GitHub
Added by dalley almost 3 years ago

Fixes workers crashing on restart edge case

Fixes an issue where workers would crash when attempting to revoke() a previously running task on startup. If the user sends SIGKILL to the worker while running a task, the worker will attempt to mark the task as stale when restarted. This causes a hang for yet-unknown reasons. Calling revoke() is not necessary when a dead worker is being cleaned up because all work goes through dedicated queues which disappear when the worker does. We leave revoke() active only when a task is being cancelled outside of worker cleanup.

closes #2835 https://pulp.plan.io/issues/2835

modified:   server/pulp/server/async/tasks.py
modified:   server/test/unit/server/async/test_tasks.py

Revision 76b756f3 View on GitHub
Added by dalley almost 3 years ago

Fixes workers crashing on restart edge case

Fixes an issue where workers would crash when attempting to revoke() a previously running task on startup. If the user sends SIGKILL to the worker while running a task, the worker will attempt to mark the task as stale when restarted. This causes a hang for yet-unknown reasons. Calling revoke() is not necessary when a dead worker is being cleaned up because all work goes through dedicated queues which disappear when the worker does. We leave revoke() active only when a task is being cancelled outside of worker cleanup.

closes #2835 https://pulp.plan.io/issues/2835

modified:   server/pulp/server/async/tasks.py
modified:   server/test/unit/server/async/test_tasks.py

(cherry picked from commit aa7a1de219b02bb0aa5a3674a72d77a653ee968f)

History

#1 Updated by bmbouter over 3 years ago

I was able to reproduce the issue. Upon further investigation I can see at least three issues manifesting with this one reproducer...

  • [Issue A] The resource_manager queue looses a currently running _queue_reserved_task if the resource manager is restarted with sudo systemctl restart pulp_resource_manager. The task is lost from the queue but still has an incorrect TaskStatus record showing as waiting which will never run. Note that if you sudo pkill -9 -f resource_manager and the sudo systemctl start pulp_resource_manager it does not loose the task.
  • [Issue B] Restarting pulp_workers while a worker is processing a task leaves the worker in a broker-forever state. The reproducer causes 1 task to be running on the worker (with a _release_resource) task still in the worker's .dq queue and the 3 publishes left on the resource manager:
  queue                                         dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =============================================================================================================================
  reserved_resource_worker-0@dev.dq             Y    Y                 1     2      1     931   1.90k     967         1     2
  resource_manager                              Y                      3    42     39    3.71k  49.6k    45.9k        1     2

After running sudo systemctl restart pulp_workers the process does (verified by the pids changing) but the worker that starts does not start processing tasks. The messages are ready for the worker in their .dq queue:

  queue                                         dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  =============================================================================================================================
  reserved_resource_worker-0@dev.dq             Y    Y                 2     2      0    1.95k  1.95k       0         1     2
  resource_manager                              Y                      2    42     40    3.71k  49.6k    45.9k        1     2

Note that if you do the same reproducer, but you don't run sudo systemctl restart pulp_workers and instead you run sudo pkill -9 -f reserved_resource_worker;sudo systemctl stop pulp_workers;sudo systemctl start pulp_workers the worker starts up in a functional state.

  • [Issue C] The systemd file for pulp_workers will not start if the the workers are stopped outside of systemd. Specifically with workers running:
sudo systemctl start pulp_workers  # Start pulp_workers with systemd (you can verify they run with ps -awfux | grep celery)
sudo pkill -9 -f reserved_resource_worker  # Stop the workers outside of systemd
sudo systemctl start pulp_workers  # Start the workers again with systemd
ps -awfux | grep celery # Note that no workers are running

This is an issue with how we chain-load the workers.

#2 Updated by ttereshc over 3 years ago

This issue will be split into 3 issues (as of previous comment) and triaged then.

#3 Updated by ttereshc over 3 years ago

For "Issue C" #2837 was created.

#4 Updated by ttereshc over 3 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to 40
  • Severity changed from 2. Medium to 3. High
  • Triaged changed from No to Yes

This issue is triaged as "Issue B".
@dralley will file a new issue for "Issue A".

#6 Updated by dalley over 3 years ago

Issue A has been filed here: https://pulp.plan.io/issues/2861

#7 Updated by mhrivnak over 3 years ago

  • Sprint/Milestone changed from 40 to 41

#8 Updated by dalley over 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley

#9 Updated by mhrivnak over 3 years ago

  • Sprint/Milestone changed from 41 to 42

#10 Updated by ttereshc over 3 years ago

  • Related to Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run added

#11 Updated by mhrivnak over 3 years ago

  • Sprint/Milestone changed from 42 to 43

#12 Updated by jortel@redhat.com about 3 years ago

  • Sprint/Milestone changed from 43 to 44

#13 Updated by dalley about 3 years ago

  • Status changed from ASSIGNED to MODIFIED

#14 Updated by dalley about 3 years ago

  • Status changed from MODIFIED to POST

I accidentally pushed this to pulp/2.14-dev instead of my own Github. Apologies.

PR is here: https://github.com/pulp/pulp/pull/3137

#15 Updated by dalley about 3 years ago

Summary:

Restarting a worker that is currently executing a task will leave that workers in a broken state. This issue can reproduced on both Celery 3.1.x and Celery 4.x, but only while using Qpid as a broker. I was not able to reproduce this issue while using RabbitMQ as a broker, using either version of Celery. I was also not able to reproduce this issue on versions of Pulp prior to 2.13. The means of shutting down the workers also does not appear to matter, e.g. "systemctl restart" and "pkill -9 celery; prestart" both work the same.

Reproduction Steps:

1. Start pulp
2. Begin a task (e.g. sync)
3. While the task is running, restart the pulp worker running the task
4. After the worker has restarted, begin another task
5. Observe that the tasks are perpetually stuck in waiting

(Exact steps)

1. prestart
2. pulp-admin rpm repo sync run --repo-id zoo

(While the task is still running, in another terminal)

3a. sudo systemctl restart pulp_workers
OR
3b. pkill -9 celery; prestart

4. pulp-admin rpm repo sync run --repo-id zoo
5. Observe that the task is perpetually stuck in waiting

Symptoms:

Only the worker which was running a task at the time of the worker restart is rendered frozen. Future work assigned to this worker will not be executed - this can cause it to seem as though Pulp is entirely frozen, because work is typically assigned to worker zero. Other workers are fine.

From the CLI:

  [vagrant@pulp2 ~]$ pulp-admin rpm repo sync run --repo-id zoo
  +----------------------------------------------------------------------+
                       Synchronizing Repository [zoo]
  +----------------------------------------------------------------------+

  This command may be exited via ctrl+c without affecting the request.

  Downloading metadata...
  [\]
  ... completed

  Downloading repository content...
  [-]
  [==================================================] 100%
  RPMs:       32/32 items
  Delta RPMs: 0/0 items

  Task Canceled

  [vagrant@pulp2 ~]$ pulp-admin rpm repo sync run --repo-id zoo
  +----------------------------------------------------------------------+
                       Synchronizing Repository [zoo]
  +----------------------------------------------------------------------+

  This command may be exited via ctrl+c without affecting the request.

  [\]
  Waiting to begin...

We can see that Pulp worker is not consuming work from it's associated dedicated queue - the messages sent to the queue by dispatching a new task are not being read out of the queue by the worker.

State of queues before running any tasks


  queue                                               dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ===================================================================================================================================
  reserved_resource_worker-0@pulp2.dev.dq             Y    Y                 0     0      0       0      0        0         1     2
  resource_manager                                    Y                      0    36     36       0   43.1k    43.1k        1     2

State of queues after restarting worker mid-task


  queue                                               dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ===================================================================================================================================
  reserved_resource_worker-0@pulp2.dev.dq             Y    Y                 0     0      0       0      0        0         1     2
  resource_manager                                    Y                      0    40     40       0   47.8k    47.8k        1     2

State of queues after dispatching a new task to the frozen worker

  queue                                               dur  autoDel  excl  msg   msgIn  msgOut  bytes  bytesIn  bytesOut  cons  bind
  ===================================================================================================================================
  reserved_resource_worker-0@pulp2.dev.dq             Y    Y                 2     2      0    1.93k  1.93k       0         1     2
  resource_manager                                    Y                      0    41     41       0   49.0k    49.0k        1     2

If you restart the worker immediately after restarting it mid-task, without assigning any new work to the queue, it will restart cleanly and will continue on accepting work normally.

How to "fix":

List all of the tasks in waiting with 'pulp-admin tasks list', cancel them all with 'pulp-admin tasks list cancel', and then restart the worker

  • If you cancel the new task without restarting the worker and then issue a new task, that task will become hung just the same
  • If you restart the worker without cancelling a current hung task, the task will be marked cancelled on reboot of the worker, but future work dispatched to the worker will remain hung
  • If you do not assign new work to the worker, then restart the worker, it will continue accepting work normally

Origin of this issue:

A change was introduced in Pulp 2.13 which used of Celery's Bootsteps feature as a means for workers to write their own heartbeats to the database. This appears to have triggered the conditions to make this bug emergent.

That PR is here: https://github.com/pulp/pulp/pull/2922

After this change, when the function we registered to be executed on the celeryd_after_setup signal is run, the call to _delete_worker() -> cancel() -> controller.revoke() appears cause the worker to become frozen, whereas previously it had worked fine.

https://github.com/pulp/pulp/blob/master/server/pulp/server/async/tasks.py#L663
https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L158

The exact reason that using bootsteps causes the call to revoke() to put the worker in a bad state is still undetermined.

Mitigation:

My current pull request [0] to Pulp can mitigate this issue by registering a new terminate() function on the Celery consumer (called as part of the shutdown process of the worker) which calls _delete_worker() itself. This only fixes the case where the workers are shutdown "kindly", but this is going to be the most common case, and the fix definitely does work 100% for that case.

[0] https://github.com/pulp/pulp/pull/3137/files

Pulp already code which was supposed to accomplish this using the worker_shutdown signal [1], but upon testing I found that while the signal was being called correctly, the qpid.messaging thread had already been stopped at that point, and the resulting traceback prevented the code from reaching the portion of _delete_worker() where the Pulp task is marked cancelled. Therefore, when the worker restarts, it attempts to clean up those workers, calling revoke() and getting frozen up.

[1] https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L171

This approach fixes the issue in the case where the workers are shut down "nicely" aka with SIGTERM. An expanded fix also solves the "pkill" case by moving the pulp worker cleanup and initialization code from the celeryd_after_setup signal handler to the start() method on the Consumer bootstep. However, I have no early idea WHY this works, or why having that code celeryd_after_setup broke when we started using bootsteps in the first place. So I'm not comfortable pushing that expanded fix until those questions are answered.

Debugging:

Attached is the output of running "thread apply all py-bt" on GDB coredumps of the parent and child Celery processes in 4 states.

nowork_* => Clean processes before any work whatsoever has been dispatched
postsync_* => Processes after work has been dispatched normally, without restarting the workers mid-execution
restarted_* => Processes after the pulp workers have been restarted mid-execution of a task
hung_* => Processes after being restarted and having new work dispatched, which are now hung

All 4 parent dumps are identical. The two "clean" child dumps, nowork_child.txt and postsync_child.txt, are also identical to each other, as are the two "dirty" child dumps, hung_child.txt and restarted_child.txt. So, for the sake of comparing differences between the dumps, we only need to look at two - which may as well be nowork_child.txt and hung_child.txt

#16 Updated by dalley about 3 years ago

The migitation patch was applied in PR https://github.com/pulp/pulp/pull/3137/, so now we only need to worry about the case where workers get killed unceremoniously (i.e. OOM, pkill, etc.)

#17 Updated by dalley about 3 years ago

The aforementioned "full workaround" I found by tinkering, but which I don't understand how it works, is here:

https://github.com/pulp/pulp/compare/master...dralley:2835-hack-fix

I consider this to be hacky... this doesn't address why the introduction of bootsteps broke previously-functional code within the celeryd_after_setup signal handler.

I've attached two additional logs, where I started some minimal celery workers which are structurally identical to the pulp workers, and watched their startup sequences.

test_worker1 (similar to the pre-2.13 pulp worker code)

from celery import Celery
from celery.signals import celeryd_after_setup

@celeryd_after_setup.connect
def test_signal(sender, instance, **kwargs):
    print('{0!r} celeryd_after_setup signal fired'.format(sender))

app = Celery(broker='amqp://')

test_worker2 (similar to the 2.13+ pulp_worker code)

from celery import Celery
from celery import bootsteps
from celery.signals import celeryd_after_setup

class Reproducer(bootsteps.StartStopStep):
    def __init__(self, parent, **kwargs):
        print('{0!r} is in init'.format(parent))

    def start(self, worker):
        print('{0!r} is starting up'.format(worker))

        self.timer_ref = worker.timer.call_repeatedly(
            5,
            self.do_work,
            (worker, ),
            priority=10,
        )

    def do_work(self, worker):
        print('{0!r} heartbeat'.format(worker))

    def stop(self, parent):
        print('{0!r} is stopping'.format(parent))

    def shutdown(self, parent):
        print('{0!r} is shutting down'.format(parent))

@celeryd_after_setup.connect
def test_signal(sender, instance, **kwargs):
    print('{0!r} celeryd_after_setup signal fired'.format(sender))

app = Celery(broker='amqp://')
app.steps['consumer'].add(Reproducer)

There are differences in the logs, but I don't see those differences as being meaningful, since they all occur after the point that we care about.

#18 Updated by mhrivnak about 3 years ago

  • Sprint/Milestone changed from 44 to 45

#19 Updated by jortel@redhat.com about 3 years ago

  • Sprint/Milestone changed from 45 to 46

#20 Updated by mhrivnak about 3 years ago

  • Sprint/Milestone changed from 46 to 47

#21 Updated by dalley about 3 years ago

I made one other discovery:

I wanted to see what would happen if I attached the heartbeat bootstep to the worker as it had been originally, before I made the fix in this PR: https://github.com/pulp/pulp/pull/2984

It works! However, that fix was made for a reason - we needed the heartbeats to stop when the broker connection was lost, and restart when it was regained. The celery "worker" (back to overlapping terms.. this is the "worker" component not the broad concept of a worker) has no knowledge of the state of the broker, so the functionality would need to be attached to the celery consumer component.

https://github.com/pulp/pulp/pull/2984/files#diff-ac9a188d0b9425fa260a49c7def6aa0fL124

So, reverting the change from that PR fixes #2835 but breaks reconnect support. Hopefully this helps us narrow down the cause further.

To reproduce this, check out commit 3a3b5f020eca1d019f51301ffe5d9bc2dbffcdb2 on the pulp repo (one commit prior to the PR in question)

#22 Updated by rchan about 3 years ago

  • Sprint/Milestone changed from 47 to 48

#25 Updated by dalley almost 3 years ago

  • Status changed from POST to ASSIGNED

#26 Updated by rchan almost 3 years ago

  • Sprint/Milestone changed from 48 to 52

#27 Updated by rchan almost 3 years ago

  • Sprint/Milestone changed from 52 to 53

#28 Updated by dalley almost 3 years ago

  • Status changed from ASSIGNED to POST

#29 Updated by dalley almost 3 years ago

  • Status changed from POST to MODIFIED

#30 Updated by bmbouter almost 3 years ago

  • Platform Release set to 2.15.2

#32 Updated by daviddavis almost 3 years ago

  • Status changed from MODIFIED to 5

#33 Updated by pthomas@redhat.com almost 3 years ago

Tested this by following the steps from #15

#34 Updated by bmbouter almost 3 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#36 Updated by bmbouter over 2 years ago

  • Sprint set to Sprint 31

#37 Updated by bmbouter over 2 years ago

  • Sprint/Milestone deleted (53)

#38 Updated by bmbouter over 1 year ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF