Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-06-29T13:03:54Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/20486/diff?detail_id=21056">diff</a>)</li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-06-29T15:58:14Z</p> <ul></ul><p>I believe the root cause of this is in Celery code. The next step is to prove that by making a simple rabbitMQ + Celery 4.0.z reproducer. The reproducer can use a task like:</p> <pre><code>from celery import Task ## not 100% sure on this line, but you get the idea @task(base=Task, acks_late=True) def dummy_task(name, inner_task_id, resource_id, inner_args, inner_kwargs, options): import time time.sleep(600) # sleep for five minutes </code></pre> <p>The key is to have a worker handling a task with <code>acks_late=True</code> when it is restarted. Also we want to even remove systemd from the reproducer. Ultimately systemd just sends signals to the process so you can file a better reproducer if you tell celery to restart using the <code>signal</code> command.</p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-06-29T15:58:52Z</p> <ul></ul><p>Oh also this should be added to the sprint. It's a reliability issue that will affect both Pulp2 and Pulp3.</p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-06-29T18:55:15Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/20519/diff?detail_id=21082">diff</a>)</li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-06-30T14:53:39Z</p> <ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li><li><strong>Sprint/Milestone</strong> set to <i>40</i></li><li><strong>Triaged</strong> changed from <i>No</i> to <i>Yes</i></li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-03T12:37:59Z</p> <ul><li><strong>Sprint/Milestone</strong> changed from <i>40</i> to <i>41</i></li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-06T19:25:54Z</p> <ul><li><strong>Status</strong> changed from <i>NEW</i> to <i>ASSIGNED</i></li><li><strong>Assignee</strong> set to <i>daviddavis</i></li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-13T14:41:13Z</p> <ul></ul><p>FWIW I'm seeing the queue empty as well when I just run systemctl stop:</p> <pre><code>[vagrant@pulp2 ~]$ qpid-stat -q Queues queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind ========================================================================================================================= 8d7f6fbb-a788-4025-9370-264eef9214aa:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2 9b04afc7-3f8a-46b4-8eae-0d1b7d5a0947:1.0 Y Y 0 2 2 0 486 486 1 2 c959209f-4f80-4690-9a8d-45a326f6ffa6:0.0 Y Y 0 0 0 0 0 0 1 2 celery Y 2 98 96 1.67k 81.4k 79.7k 0 2 f2768b43-411b-40b9-9277-ec1770c1555c:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2 fdbde372-02dd-4a27-b8d1-f394903cfd9f:1.0 Y Y 0 8 8 0 4.88k 4.88k 1 2 pulp.task Y 0 0 0 0 0 0 3 1 resource_manager Y 1 4 3 1.19k 4.76k 3.58k 0 2 [vagrant@pulp2 ~]$ sudo systemctl stop pulp_resource_manager [vagrant@pulp2 ~]$ qpid-stat -q Queues queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind ========================================================================================================================= 8d7f6fbb-a788-4025-9370-264eef9214aa:1.0 Y Y 0 4 4 0 2.42k 2.42k 1 2 97b59cc7-1398-48cf-99eb-6f2a14810dec:0.0 Y Y 0 0 0 0 0 0 1 2 celery Y 2 98 96 1.67k 81.4k 79.7k 0 2 f2768b43-411b-40b9-9277-ec1770c1555c:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2 pulp.task Y 0 0 0 0 0 0 3 1 resource_manager Y 0 4 4 0 4.76k 4.76k 0 2 </code></pre> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-13T15:51:44Z</p> <ul></ul><p>Looks like using "sudo pkill -f resource_manager" causes the queue to become empty:</p> <pre><code>[vagrant@pulp2 ~]$ qpid-stat -q Queues queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind ========================================================================================================================= 4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2 54271755-b7f0-4587-b801-2864083ba242:0.0 Y Y 0 0 0 0 0 0 1 2 887d958d-49d5-404f-8820-37d8abae3245:1.0 Y Y 0 2 2 0 486 486 1 2 a560e0d7-e702-4967-9ebf-9ea20922688d:1.0 Y Y 0 8 8 0 4.88k 4.88k 1 2 a560e0d7-e702-4967-9ebf-9ea20922688d:2.0 Y Y 0 4 4 0 2.47k 2.47k 1 2 celery Y 0 0 0 0 0 0 0 2 pulp.task Y 0 0 0 0 0 0 3 1 resource_manager Y 1 2 1 1.19k 2.37k 1.19k 1 2 resource_manager@pulp2.dev.celery.pidbox Y 0 0 0 0 0 0 1 2 resource_manager@pulp2.dev.dq Y Y 0 0 0 0 0 0 1 2 [vagrant@pulp2 ~]$ sudo pkill -f resource_manager [vagrant@pulp2 ~]$ qpid-stat -q Queues queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind ========================================================================================================================= 4b77caca-2e9f-4e59-9f59-d81f8a2e4b2b:1.0 Y Y 0 4 4 0 2.46k 2.46k 1 2 c45eaa67-b86c-4267-93b9-f23a67b0c3c5:0.0 Y Y 0 0 0 0 0 0 1 2 celery Y 0 0 0 0 0 0 0 2 pulp.task Y 0 0 0 0 0 0 3 1 resource_manager Y 0 2 2 0 2.37k 2.37k 0 2 </code></pre> <p>This corresponds to what we're seeing with systemctl stop which uses SIGTERM, waits for a period of time, and then sends SIGKILL. Looks like the issue is with the SIGTERM.</p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-17T16:43:31Z</p> <ul></ul><p>After doing some Googling, I found this particular bug:</p> <p><a href="https://github.com/celery/celery/issues/3057" class="external">https://github.com/celery/celery/issues/3057</a></p> <p>In debugging, I found that if I comment out this section, restarting pulp_resource_manager doesn't wipe the queue:</p> <p><a href="https://github.com/celery/celery/blob/51a494019e863188b39f86aec79e23305ba97311/celery/worker/job.py#L440-L442" class="external">https://github.com/celery/celery/blob/51a494019e863188b39f86aec79e23305ba97311/celery/worker/job.py#L440-L442</a></p> <p>This seems to line up with what @ask says. When resource_manager is killed via warm shutdown, it acknowledges any queued tasks. This doesn't happen during a cold shutdown (kill -9).</p> <p>Not sure how to solve this.</p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-17T16:56:12Z</p> <ul></ul><p>Looks like they've solved this problem in Celery 4 by adding an option to reject and then requeue the task when the worker is killed:</p> <p><a href="http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost" class="external">http://docs.celeryproject.org/en/latest/userguide/configuration.html#task-reject-on-worker-lost</a></p> <p>Here's the code:</p> <p><a href="https://github.com/celery/celery/blob/199cf69f98f3aa655fd9ccd59a09d22de2716b2d/celery/worker/request.py#L368-L379" class="external">https://github.com/celery/celery/blob/199cf69f98f3aa655fd9ccd59a09d22de2716b2d/celery/worker/request.py#L368-L379</a></p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-18T19:31:52Z</p> <ul></ul><p>So the decision was made to fix this bug for Celery 4. However, I am unable to reproduce this bug on Celery 4.0.2. I suspect that celery added a different codepath to properly handle a warm shutdown.</p> <p>This may or may not be related to this bug:</p> <p><a href="https://github.com/celery/celery/issues/3802" class="external">https://github.com/celery/celery/issues/3802</a></p> <p>I am tempted to close this bug out since it only affects Celery 3 and it only drops the first message in the queue (the one being processed) .</p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-23T21:41:14Z</p> <ul><li><strong>Sprint/Milestone</strong> changed from <i>41</i> to <i>42</i></li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-25T15:21:02Z</p> <ul></ul><p>Next step is that we need to confirm that the request is getting acknowledge on warm shutdown for Celery 4. It should be. Test against qpid and rabbitmq.</p> <p>Also, we should be setting task_reject_on_worker_lost to true for celery 4 in pulp but only for the resource manager.</p> <p>Lastly, we need to fix this for Pulp 2 AND Pulp 3.</p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-07-31T13:14:02Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-11 priority-7 priority-high2 closed" href="/issues/2734">Issue #2734</a>: task group never quite finishes if pulp is restarted in the middle of task run</i> added</li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-08-02T15:50:13Z</p> <ul></ul><p>Digging into celery 4 with rabbitmq, it looks like it's hitting on_failure in request with a WorkerLostError when I restart pulp_resource_manager:</p> <pre><code>> /usr/lib/python2.7/site-packages/celery/worker/request.py(377)on_failure() -> if reject: (Pdb) exc WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',) (Pdb) type(exc) <class 'billiard.exceptions.WorkerLostError'> </code></pre> <p>From debugging, it's calling acknowledge()[1]. However, I'm still stumped as to why the queue isn't being cleared like it is in celery 3.</p> <p>Also, one other observation: if I start up pulp_workers afterwards, the sync task gets processed.</p> <p>[1] <a href="https://github.com/celery/celery/blob/87b263bcea88756d870d19f27af9cb54c6f860cf/celery/worker/request.py#L379" class="external">https://github.com/celery/celery/blob/87b263bcea88756d870d19f27af9cb54c6f860cf/celery/worker/request.py#L379</a></p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-08-02T21:24:02Z</p> <ul><li><strong>Status</strong> changed from <i>ASSIGNED</i> to <i>CLOSED - WONTFIX</i></li></ul><p>To fix this for Celery 3, we'd need to carry a patch for Celery and I don't think we want to do that. I am going to close this.</p> <p>I've opened a separate task for Celery 4:</p> <p><a href="https://pulp.plan.io/issues/2954" class="external">https://pulp.plan.io/issues/2954</a></p> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2017-08-02T21:26:02Z</p> <ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-11 priority-6 priority-default closed" href="/issues/2954">Issue #2954</a>: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4</i> added</li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2018-03-08T23:21:16Z</p> <ul><li><strong>Sprint</strong> set to <i>Sprint 23</i></li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2018-03-08T23:21:41Z</p> <ul><li><strong>Sprint/Milestone</strong> deleted (<del><i>42</i></del>)</li></ul> </article> <article> <h1>Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl</h1> <p>2019-04-15T20:17:03Z</p> <ul><li><strong>Tags</strong> <i>Pulp 2</i> added</li></ul> </article> </main></body></html>