https://pulp.plan.io/https://pulp.plan.io/favicon.ico2017-06-29T13:03:54ZPulpPulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=204862017-06-29T13:03:54Zdalleydalley@redhat.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/20486/diff?detail_id=21056">diff</a>)</li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=205112017-06-29T15:58:14Zbmbouterbmbouter@redhat.com
<ul></ul><p>I believe the root cause of this is in Celery code. The next step is to prove that by making a simple rabbitMQ + Celery 4.0.z reproducer. The reproducer can use a task like:</p>
<pre><code>from celery import Task ## not 100% sure on this line, but you get the idea
@task(base=Task, acks_late=True)
def dummy_task(name, inner_task_id, resource_id, inner_args, inner_kwargs, options):
import time
time.sleep(600) # sleep for five minutes
</code></pre>
<p>The key is to have a worker handling a task with <code>acks_late=True</code> when it is restarted. Also we want to even remove systemd from the reproducer. Ultimately systemd just sends signals to the process so you can file a better reproducer if you tell celery to restart using the <code>signal</code> command.</p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=205122017-06-29T15:58:52Zbmbouterbmbouter@redhat.com
<ul></ul><p>Oh also this should be added to the sprint. It's a reliability issue that will affect both Pulp2 and Pulp3.</p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=205192017-06-29T18:55:15ZIchimonji10jerebear@protonmail.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/20519/diff?detail_id=21082">diff</a>)</li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=205272017-06-30T14:53:39Zttereshcttereshc@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li><li><strong>Sprint/Milestone</strong> set to <i>40</i></li><li><strong>Triaged</strong> changed from <i>No</i> to <i>Yes</i></li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=205742017-07-03T12:37:59Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>40</i> to <i>41</i></li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=206562017-07-06T19:25:54Zdaviddavis
<ul><li><strong>Status</strong> changed from <i>NEW</i> to <i>ASSIGNED</i></li><li><strong>Assignee</strong> set to <i>daviddavis</i></li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=208132017-07-13T14:41:13Zdaviddavis
<ul></ul><p>FWIW I'm seeing the queue empty as well when I just run systemctl stop:</p>
<pre><code>[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
8d7f6fbb-a788-4025-9370-264eef9214aa:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2
9b04afc7-3f8a-46b4-8eae-0d1b7d5a0947:1.0 Y Y 0 2 2 0 486 486 1 2
c959209f-4f80-4690-9a8d-45a326f6ffa6:0.0 Y Y 0 0 0 0 0 0 1 2
celery Y 2 98 96 1.67k 81.4k 79.7k 0 2
f2768b43-411b-40b9-9277-ec1770c1555c:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
fdbde372-02dd-4a27-b8d1-f394903cfd9f:1.0 Y Y 0 8 8 0 4.88k 4.88k 1 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 1 4 3 1.19k 4.76k 3.58k 0 2
[vagrant@pulp2 ~]$ sudo systemctl stop pulp_resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
8d7f6fbb-a788-4025-9370-264eef9214aa:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2
97b59cc7-1398-48cf-99eb-6f2a14810dec:0.0 Y Y 0 0 0 0 0 0 1 2
celery Y 2 98 96 1.67k 81.4k 79.7k 0 2
f2768b43-411b-40b9-9277-ec1770c1555c:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 0 4 4 0 4.76k 4.76k 0 2
</code></pre> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=208222017-07-13T15:51:44Zdaviddavis
<ul></ul><p>Looks like using "sudo pkill -f resource_manager" causes the queue to become empty:</p>
<pre><code>[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
54271755-b7f0-4587-b801-2864083ba242:0.0 Y Y 0 0 0 0 0 0 1 2
887d958d-49d5-404f-8820-37d8abae3245:1.0 Y Y 0 2 2 0 486 486 1 2
a560e0d7-e702-4967-9ebf-9ea20922688d:1.0 Y Y 0 8 8 0 4.88k 4.88k 1 2
a560e0d7-e702-4967-9ebf-9ea20922688d:2.0 Y Y 0 4 4 0 2.47k 2.47k 1 2
celery Y 0 0 0 0 0 0 0 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 1 2 1 1.19k 2.37k 1.19k 1 2
resource_manager@pulp2.dev.celery.pidbox Y 0 0 0 0 0 0 1 2
resource_manager@pulp2.dev.dq Y Y 0 0 0 0 0 0 1 2
[vagrant@pulp2 ~]$ sudo pkill -f resource_manager
[vagrant@pulp2 ~]$ qpid-stat -q
Queues
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=========================================================================================================================
4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2
c45eaa67-b86c-4267-93b9-f23a67b0c3c5:0.0 Y Y 0 0 0 0 0 0 1 2
celery Y 0 0 0 0 0 0 0 2
pulp.task Y 0 0 0 0 0 0 3 1
resource_manager Y 0 2 2 0 2.37k 2.37k 0 2
</code></pre>
<p>This corresponds to what we're seeing with systemctl stop which uses SIGTERM, waits for a period of time, and then sends SIGKILL. Looks like the issue is with the SIGTERM.</p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=209532017-07-17T16:43:31Zdaviddavis
<ul></ul><p>After doing some Googling, I found this particular bug:</p>
<p><a href="https://github.com/celery/celery/issues/3057" class="external">https://github.com/celery/celery/issues/3057</a></p>
<p>In debugging, I found that if I comment out this section, restarting pulp_resource_manager doesn't wipe the queue:</p>
<p><a href="https://github.com/celery/celery/blob/51a494019e863188b39f86aec79e23305ba97311/celery/worker/job.py#L440-L442" class="external">https://github.com/celery/celery/blob/51a494019e863188b39f86aec79e23305ba97311/celery/worker/job.py#L440-L442</a></p>
<p>This seems to line up with what @ask says. When resource_manager is killed via warm shutdown, it acknowledges any queued tasks. This doesn't happen during a cold shutdown (kill -9).</p>
<p>Not sure how to solve this.</p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=209552017-07-17T16:56:12Zdaviddavis
<ul></ul><p>Looks like they've solved this problem in Celery 4 by adding an option to reject and then requeue the task when the worker is killed:</p>
<p><a href="http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost" class="external">http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost</a></p>
<p>Here's the code:</p>
<p><a href="https://github.com/celery/celery/blob/199cf69f98f3aa655fd9ccd59a09d22de2716b2d/celery/worker/request.py#L368-L379" class="external">https://github.com/celery/celery/blob/199cf69f98f3aa655fd9ccd59a09d22de2716b2d/celery/worker/request.py#L368-L379</a></p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=210142017-07-18T19:31:52Zdaviddavis
<ul></ul><p>So the decision was made to fix this bug for Celery 4. However, I am unable to reproduce this bug on Celery 4.0.2. I suspect that celery added a different codepath to properly handle a warm shutdown.</p>
<p>This may or may not be related to this bug:</p>
<p><a href="https://github.com/celery/celery/issues/3802" class="external">https://github.com/celery/celery/issues/3802</a></p>
<p>I am tempted to close this bug out since it only affects Celery 3 and it only drops the first message in the queue (the one being processed) .</p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=211702017-07-23T21:41:14Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>41</i> to <i>42</i></li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=212002017-07-25T15:21:02Zdaviddavis
<ul></ul><p>Next step is that we need to confirm that the request is getting acknowledge on warm shutdown for Celery 4. It should be. Test against qpid and rabbitmq.</p>
<p>Also, we should be setting task_reject_on_worker_lost to true for celery 4 in pulp but only for the resource manager.</p>
<p>Lastly, we need to fix this for Pulp 2 AND Pulp 3.</p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=212872017-07-31T13:14:02Zttereshcttereshc@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-11 priority-7 priority-high2 closed" href="/issues/2734">Issue #2734</a>: task group never quite finishes if pulp is restarted in the middle of task run</i> added</li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=213672017-08-02T15:50:13Zdaviddavis
<ul></ul><p>Digging into celery 4 with rabbitmq, it looks like it's hitting on_failure in request with a WorkerLostError when I restart pulp_resource_manager:</p>
<pre><code>> /usr/lib/python2.7/site-packages/celery/worker/request.py(377)on_failure()
-> if reject:
(Pdb) exc
WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
(Pdb) type(exc)
<class 'billiard.exceptions.WorkerLostError'>
</code></pre>
<p>From debugging, it's calling acknowledge()[1]. However, I'm still stumped as to why the queue isn't being cleared like it is in celery 3.</p>
<p>Also, one other observation: if I start up pulp_workers afterwards, the sync task gets processed.</p>
<p>[1] <a href="https://github.com/celery/celery/blob/87b263bcea88756d870d19f27af9cb54c6f860cf/celery/worker/request.py#L379" class="external">https://github.com/celery/celery/blob/87b263bcea88756d870d19f27af9cb54c6f860cf/celery/worker/request.py#L379</a></p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=213742017-08-02T21:24:02Zdaviddavis
<ul><li><strong>Status</strong> changed from <i>ASSIGNED</i> to <i>CLOSED - WONTFIX</i></li></ul><p>To fix this for Celery 3, we'd need to carry a patch for Celery and I don't think we want to do that. I am going to close this.</p>
<p>I've opened a separate task for Celery 4:</p>
<p><a href="https://pulp.plan.io/issues/2954" class="external">https://pulp.plan.io/issues/2954</a></p> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=213822017-08-02T21:26:02Zdaviddavis
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-11 priority-6 priority-default closed" href="/issues/2954">Issue #2954</a>: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4</i> added</li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=261142018-03-08T23:21:16Zbmbouterbmbouter@redhat.com
<ul><li><strong>Sprint</strong> set to <i>Sprint 23</i></li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=261312018-03-08T23:21:41Zbmbouterbmbouter@redhat.com
<ul><li><strong>Sprint/Milestone</strong> deleted (<del><i>42</i></del>)</li></ul> Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlhttps://pulp.plan.io/issues/2861?journal_id=382092019-04-15T20:17:03Zbmbouterbmbouter@redhat.com
<ul><li><strong>Tags</strong> <i>Pulp 2</i> added</li></ul>