https://pulp.plan.io/https://pulp.plan.io/favicon.ico2017-06-23T14:17:20ZPulpPulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=203622017-06-23T14:17:20Zbmbouterbmbouter@redhat.com
<ul></ul><p>I was able to reproduce the issue. Upon further investigation I can see at least three issues manifesting with this one reproducer...</p>
<ul>
<li>[Issue A] The resource_manager queue looses a currently running <a href="https://github.com/pulp/pulp/blob/232e5d66405147a23e8e7a071468199e1537d4a0/server/pulp/server/async/tasks.py#L111" class="external">_queue_reserved_task</a> if the resource manager is restarted with <code>sudo systemctl restart pulp_resource_manager</code>. The task is lost from the queue but still has an incorrect TaskStatus record showing as waiting which will never run. Note that if you <code>sudo pkill -9 -f resource_manager</code> and the <code>sudo systemctl start pulp_resource_manager</code> it does <strong>not</strong> loose the task.</li>
</ul>
<ul>
<li>[Issue B] Restarting <code>pulp_workers</code> while a worker is processing a task leaves the worker in a broker-forever state. The reproducer causes 1 task to be running on the worker (with a _release_resource) task still in the worker's .dq queue and the 3 publishes left on the resource manager:</li>
</ul>
<pre><code> queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=============================================================================================================================
reserved_resource_worker-0@dev.dq Y Y 1 2 1 931 1.90k 967 1 2
resource_manager Y 3 42 39 3.71k 49.6k 45.9k 1 2
</code></pre>
<p>After running <code>sudo systemctl restart pulp_workers</code> the process does (verified by the pids changing) but the worker that starts does not start processing tasks. The messages are ready for the worker in their .dq queue:</p>
<pre><code> queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
=============================================================================================================================
reserved_resource_worker-0@dev.dq Y Y 2 2 0 1.95k 1.95k 0 1 2
resource_manager Y 2 42 40 3.71k 49.6k 45.9k 1 2
</code></pre>
<p>Note that if you do the same reproducer, but you don't run <code>sudo systemctl restart pulp_workers</code> and instead you run <code>sudo pkill -9 -f reserved_resource_worker;sudo systemctl stop pulp_workers;sudo systemctl start pulp_workers</code> the worker starts up in a functional state.</p>
<ul>
<li>[Issue C] The systemd file for pulp_workers will not start if the the workers are stopped outside of systemd. Specifically with workers running:</li>
</ul>
<pre><code>sudo systemctl start pulp_workers # Start pulp_workers with systemd (you can verify they run with ps -awfux | grep celery)
sudo pkill -9 -f reserved_resource_worker # Stop the workers outside of systemd
sudo systemctl start pulp_workers # Start the workers again with systemd
ps -awfux | grep celery # Note that no workers are running
</code></pre>
<p>This is an issue with how we chain-load the workers.</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=203702017-06-23T15:12:52Zttereshcttereshc@redhat.com
<ul></ul><p>This issue will be split into 3 issues (as of previous comment) and triaged then.</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=204292017-06-27T13:14:37Zttereshcttereshc@redhat.com
<ul></ul><p>For "Issue C" <a class="issue tracker-1 status-9 priority-6 priority-default closed" title="Issue: The systemd file for pulp_workers will not start the workers if they have been killed without usi... (CLOSED - WONTFIX)" href="https://pulp.plan.io/issues/2837">#2837</a> was created.</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=204322017-06-27T14:40:26Zttereshcttereshc@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li><li><strong>Sprint/Milestone</strong> set to <i>40</i></li><li><strong>Severity</strong> changed from <i>2. Medium</i> to <i>3. High</i></li><li><strong>Triaged</strong> changed from <i>No</i> to <i>Yes</i></li></ul><p>This issue is triaged as "Issue B".<br>
@dralley will file a new issue for "Issue A".</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=204972017-06-29T14:20:02Zdalleydalley@redhat.com
<ul></ul><p>Issue A has been filed here: <a href="https://pulp.plan.io/issues/2861" class="external">https://pulp.plan.io/issues/2861</a></p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=205732017-07-03T12:37:58Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>40</i> to <i>41</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=207672017-07-12T14:18:23Zdalleydalley@redhat.com
<ul><li><strong>Status</strong> changed from <i>NEW</i> to <i>ASSIGNED</i></li><li><strong>Assignee</strong> set to <i>dalley</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=211672017-07-23T21:41:12Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>41</i> to <i>42</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=212852017-07-31T13:13:32Zttereshcttereshc@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-11 priority-7 priority-high2 closed" href="/issues/2734">Issue #2734</a>: task group never quite finishes if pulp is restarted in the middle of task run</i> added</li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=215502017-08-14T14:00:21Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>42</i> to <i>43</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=217842017-09-05T16:04:47Zjortel@redhat.comjortel@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>43</i> to <i>44</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=218382017-09-09T21:09:33Zdalleydalley@redhat.com
<ul><li><strong>Status</strong> changed from <i>ASSIGNED</i> to <i>MODIFIED</i></li></ul><p>Applied in changeset <a class="changeset" title="Fix issue causing worker to be left broken forever Fixes an issue where workers can be left in a..." href="https://pulp.plan.io/projects/pulp/repository/pulp/revisions/881a5fb9fdaf9813d0dbb576ab7ca3c7b3dc8476">pulp|881a5fb9fdaf9813d0dbb576ab7ca3c7b3dc8476</a>.</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=218392017-09-09T21:36:26Zdalleydalley@redhat.com
<ul><li><strong>Status</strong> changed from <i>MODIFIED</i> to <i>POST</i></li></ul><p>I accidentally pushed this to pulp/2.14-dev instead of my own Github. Apologies.</p>
<p>PR is here: <a href="https://github.com/pulp/pulp/pull/3137" class="external">https://github.com/pulp/pulp/pull/3137</a></p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=218652017-09-13T04:02:03Zdalleydalley@redhat.com
<ul><li><strong>File</strong> <a href="/attachments/333">gdb_process_dumps.zip</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/333/gdb_process_dumps.zip">gdb_process_dumps.zip</a> added</li></ul><a name="Summary"></a>
<h2 >Summary:<a href="#Summary" class="wiki-anchor">¶</a></h2>
<p>Restarting a worker that is currently executing a task will leave that workers in a broken state. This issue can reproduced on both Celery 3.1.x and Celery 4.x, but only while using Qpid as a broker. I was not able to reproduce this issue while using RabbitMQ as a broker, using either version of Celery. I was also not able to reproduce this issue on versions of Pulp prior to 2.13. The means of shutting down the workers also does not appear to matter, e.g. "systemctl restart" and "pkill -9 celery; prestart" both work the same.</p>
<a name="Reproduction-Steps"></a>
<h2 >Reproduction Steps:<a href="#Reproduction-Steps" class="wiki-anchor">¶</a></h2>
<p>1. Start pulp<br>
2. Begin a task (e.g. sync)<br>
3. While the task is running, restart the pulp worker running the task<br>
4. After the worker has restarted, begin another task<br>
5. Observe that the tasks are perpetually stuck in waiting</p>
<p>(Exact steps)</p>
<p>1. prestart<br>
2. pulp-admin rpm repo sync run --repo-id zoo</p>
<p>(While the task is still running, in another terminal)</p>
<p>3a. sudo systemctl restart pulp_workers<br>
OR<br>
3b. pkill -9 celery; prestart</p>
<p>4. pulp-admin rpm repo sync run --repo-id zoo<br>
5. Observe that the task is perpetually stuck in waiting</p>
<a name="Symptoms"></a>
<h2 >Symptoms:<a href="#Symptoms" class="wiki-anchor">¶</a></h2>
<p>Only the worker which was running a task at the time of the worker restart is rendered frozen. Future work assigned to this worker will not be executed - this can cause it to seem as though Pulp is entirely frozen, because work is typically assigned to worker zero. Other workers are fine.</p>
<p>From the CLI:</p>
<pre><code> [vagrant@pulp2 ~]$ pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
Synchronizing Repository [zoo]
+----------------------------------------------------------------------+
This command may be exited via ctrl+c without affecting the request.
Downloading metadata...
[\]
... completed
Downloading repository content...
[-]
[==================================================] 100%
RPMs: 32/32 items
Delta RPMs: 0/0 items
Task Canceled
[vagrant@pulp2 ~]$ pulp-admin rpm repo sync run --repo-id zoo
+----------------------------------------------------------------------+
Synchronizing Repository [zoo]
+----------------------------------------------------------------------+
This command may be exited via ctrl+c without affecting the request.
[\]
Waiting to begin...
</code></pre>
<p>We can see that Pulp worker is not consuming work from it's associated dedicated queue - the messages sent to the queue by dispatching a new task are not being read out of the queue by the worker.</p>
<p>State of queues before running any tasks</p>
<pre><code>
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
===================================================================================================================================
reserved_resource_worker-0@pulp2.dev.dq Y Y 0 0 0 0 0 0 1 2
resource_manager Y 0 36 36 0 43.1k 43.1k 1 2
</code></pre>
<p>State of queues after restarting worker mid-task</p>
<pre><code>
queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
===================================================================================================================================
reserved_resource_worker-0@pulp2.dev.dq Y Y 0 0 0 0 0 0 1 2
resource_manager Y 0 40 40 0 47.8k 47.8k 1 2
</code></pre>
<p>State of queues after dispatching a new task to the frozen worker</p>
<pre><code> queue dur autoDel excl msg msgIn msgOut bytes bytesIn bytesOut cons bind
===================================================================================================================================
reserved_resource_worker-0@pulp2.dev.dq Y Y 2 2 0 1.93k 1.93k 0 1 2
resource_manager Y 0 41 41 0 49.0k 49.0k 1 2
</code></pre>
<p>If you restart the worker immediately after restarting it mid-task, without assigning any new work to the queue, it will restart cleanly and will continue on accepting work normally.</p>
<a name="How-to-fix"></a>
<h2 >How to "fix":<a href="#How-to-fix" class="wiki-anchor">¶</a></h2>
<p>List all of the tasks in waiting with 'pulp-admin tasks list', cancel them all with 'pulp-admin tasks list cancel', and then restart the worker</p>
<ul>
<li>If you cancel the new task without restarting the worker and then issue a new task, that task will become hung just the same</li>
<li>If you restart the worker without cancelling a current hung task, the task will be marked cancelled on reboot of the worker, but future work dispatched to the worker will remain hung</li>
<li>If you do not assign new work to the worker, then restart the worker, it will continue accepting work normally</li>
</ul>
<a name="Origin-of-this-issue"></a>
<h2 >Origin of this issue:<a href="#Origin-of-this-issue" class="wiki-anchor">¶</a></h2>
<p>A change was introduced in Pulp 2.13 which used of Celery's Bootsteps feature as a means for workers to write their own heartbeats to the database. This appears to have triggered the conditions to make this bug emergent.</p>
<p>That PR is here: <a href="https://github.com/pulp/pulp/pull/2922" class="external">https://github.com/pulp/pulp/pull/2922</a></p>
<p>After this change, when the function we registered to be executed on the celeryd_after_setup signal is run, the call to _delete_worker() -> cancel() -> controller.revoke() appears cause the worker to become frozen, whereas previously it had worked fine.</p>
<p><a href="https://github.com/pulp/pulp/blob/master/server/pulp/server/async/tasks.py#L663" class="external">https://github.com/pulp/pulp/blob/master/server/pulp/server/async/tasks.py#L663</a><br>
<a href="https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L158" class="external">https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L158</a></p>
<p>The exact reason that using bootsteps causes the call to revoke() to put the worker in a bad state is still undetermined.</p>
<a name="Mitigation"></a>
<h2 >Mitigation:<a href="#Mitigation" class="wiki-anchor">¶</a></h2>
<p>My current pull request [0] to Pulp can mitigate this issue by registering a new terminate() function on the Celery consumer (called as part of the shutdown process of the worker) which calls _delete_worker() itself. This only fixes the case where the workers are shutdown "kindly", but this is going to be the most common case, and the fix definitely does work 100% for that case.</p>
<p>[0] <a href="https://github.com/pulp/pulp/pull/3137/files" class="external">https://github.com/pulp/pulp/pull/3137/files</a></p>
<p>Pulp already code which was supposed to accomplish this using the worker_shutdown signal [1], but upon testing I found that while the signal was being called correctly, the qpid.messaging thread had already been stopped at that point, and the resulting traceback prevented the code from reaching the portion of _delete_worker() where the Pulp task is marked cancelled. Therefore, when the worker restarts, it attempts to clean up those workers, calling revoke() and getting frozen up.</p>
<p>[1] <a href="https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L171" class="external">https://github.com/pulp/pulp/blob/master/server/pulp/server/async/app.py#L171</a></p>
<p>This approach fixes the issue in the case where the workers are shut down "nicely" aka with SIGTERM. An expanded fix also solves the "pkill" case by moving the pulp worker cleanup and initialization code from the celeryd_after_setup signal handler to the start() method on the Consumer bootstep. However, I have no early idea WHY this works, or why having that code celeryd_after_setup broke when we started using bootsteps in the first place. So I'm not comfortable pushing that expanded fix until those questions are answered.</p>
<a name="Debugging"></a>
<h2 >Debugging:<a href="#Debugging" class="wiki-anchor">¶</a></h2>
<p>Attached is the output of running "thread apply all py-bt" on GDB coredumps of the parent and child Celery processes in 4 states.</p>
<p>nowork_* => Clean processes before any work whatsoever has been dispatched<br>
postsync_* => Processes after work has been dispatched normally, without restarting the workers mid-execution<br>
restarted_* => Processes after the pulp workers have been restarted mid-execution of a task<br>
hung_* => Processes after being restarted and having new work dispatched, which are now hung</p>
<p>All 4 parent dumps are identical. The two "clean" child dumps, nowork_child.txt and postsync_child.txt, are also identical to each other, as are the two "dirty" child dumps, hung_child.txt and restarted_child.txt. So, for the sake of comparing differences between the dumps, we only need to look at two - which may as well be nowork_child.txt and hung_child.txt</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=219002017-09-15T18:11:45Zdalleydalley@redhat.com
<ul></ul><p>The migitation patch was applied in PR <a href="https://github.com/pulp/pulp/pull/3137/" class="external">https://github.com/pulp/pulp/pull/3137/</a>, so now we only need to worry about the case where workers get killed unceremoniously (i.e. OOM, pkill, etc.)</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=219132017-09-19T00:16:53Zdalleydalley@redhat.com
<ul><li><strong>File</strong> <a href="/attachments/334">celery_startup_log_bootsteps_and_signal.txt</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/334/celery_startup_log_bootsteps_and_signal.txt">celery_startup_log_bootsteps_and_signal.txt</a> added</li><li><strong>File</strong> <a href="/attachments/335">celery_startup_log_signal_only.txt</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/335/celery_startup_log_signal_only.txt">celery_startup_log_signal_only.txt</a> added</li></ul><p>The aforementioned "full workaround" I found by tinkering, but which I don't understand how it works, is here:</p>
<pre><code>https://github.com/pulp/pulp/compare/master...dralley:2835-hack-fix
</code></pre>
<p>I consider this to be hacky... this doesn't address why the introduction of bootsteps broke previously-functional code within the celeryd_after_setup signal handler.</p>
<p>I've attached two additional logs, where I started some minimal celery workers which are structurally identical to the pulp workers, and watched their startup sequences.</p>
<p>test_worker1 (similar to the pre-2.13 pulp worker code)</p>
<pre><code class="python syntaxhl" data-language="python"><span class="kn">from</span> <span class="nn">celery</span> <span class="kn">import</span> <span class="n">Celery</span>
<span class="kn">from</span> <span class="nn">celery.signals</span> <span class="kn">import</span> <span class="n">celeryd_after_setup</span>
<span class="o">@</span><span class="n">celeryd_after_setup</span><span class="p">.</span><span class="n">connect</span>
<span class="k">def</span> <span class="nf">test_signal</span><span class="p">(</span><span class="n">sender</span><span class="p">,</span> <span class="n">instance</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{0!r} celeryd_after_setup signal fired'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">sender</span><span class="p">))</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">Celery</span><span class="p">(</span><span class="n">broker</span><span class="o">=</span><span class="s">'amqp://'</span><span class="p">)</span>
</code></pre>
<p>test_worker2 (similar to the 2.13+ pulp_worker code)</p>
<pre><code class="python syntaxhl" data-language="python"><span class="kn">from</span> <span class="nn">celery</span> <span class="kn">import</span> <span class="n">Celery</span>
<span class="kn">from</span> <span class="nn">celery</span> <span class="kn">import</span> <span class="n">bootsteps</span>
<span class="kn">from</span> <span class="nn">celery.signals</span> <span class="kn">import</span> <span class="n">celeryd_after_setup</span>
<span class="k">class</span> <span class="nc">Reproducer</span><span class="p">(</span><span class="n">bootsteps</span><span class="p">.</span><span class="n">StartStopStep</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">parent</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{0!r} is in init'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">parent</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">start</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">worker</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{0!r} is starting up'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">worker</span><span class="p">))</span>
<span class="bp">self</span><span class="p">.</span><span class="n">timer_ref</span> <span class="o">=</span> <span class="n">worker</span><span class="p">.</span><span class="n">timer</span><span class="p">.</span><span class="n">call_repeatedly</span><span class="p">(</span>
<span class="mi">5</span><span class="p">,</span>
<span class="bp">self</span><span class="p">.</span><span class="n">do_work</span><span class="p">,</span>
<span class="p">(</span><span class="n">worker</span><span class="p">,</span> <span class="p">),</span>
<span class="n">priority</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">do_work</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">worker</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{0!r} heartbeat'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">worker</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">stop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">parent</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{0!r} is stopping'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">parent</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">shutdown</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">parent</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{0!r} is shutting down'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">parent</span><span class="p">))</span>
<span class="o">@</span><span class="n">celeryd_after_setup</span><span class="p">.</span><span class="n">connect</span>
<span class="k">def</span> <span class="nf">test_signal</span><span class="p">(</span><span class="n">sender</span><span class="p">,</span> <span class="n">instance</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'{0!r} celeryd_after_setup signal fired'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">sender</span><span class="p">))</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">Celery</span><span class="p">(</span><span class="n">broker</span><span class="o">=</span><span class="s">'amqp://'</span><span class="p">)</span>
<span class="n">app</span><span class="p">.</span><span class="n">steps</span><span class="p">[</span><span class="s">'consumer'</span><span class="p">].</span><span class="n">add</span><span class="p">(</span><span class="n">Reproducer</span><span class="p">)</span>
</code></pre>
<p>There are differences in the logs, but I don't see those differences as being meaningful, since they all occur after the point that we care about.</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=220212017-09-25T13:09:16Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>44</i> to <i>45</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=224342017-10-18T15:28:22Zjortel@redhat.comjortel@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>45</i> to <i>46</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=227492017-11-06T13:10:58Zmhrivnakmhrivnak@redhat.com
<ul><li><strong>Sprint/Milestone</strong> changed from <i>46</i> to <i>47</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=228412017-11-10T20:23:27Zdalleydalley@redhat.com
<ul></ul><p>I made one other discovery:</p>
<p>I wanted to see what would happen if I attached the heartbeat bootstep to the worker as it had been originally, before I made the fix in this PR: <a href="https://github.com/pulp/pulp/pull/2984" class="external">https://github.com/pulp/pulp/pull/2984</a></p>
<p>It works! However, that fix was made for a reason - we needed the heartbeats to stop when the broker connection was lost, and restart when it was regained. The celery "worker" (back to overlapping terms.. this is the "worker" component not the broad concept of a worker) has no knowledge of the state of the broker, so the functionality would need to be attached to the celery consumer component.</p>
<p><a href="https://github.com/pulp/pulp/pull/2984/files#diff-ac9a188d0b9425fa260a49c7def6aa0fL124" class="external">https://github.com/pulp/pulp/pull/2984/files#diff-ac9a188d0b9425fa260a49c7def6aa0fL124</a></p>
<p>So, reverting the change from that PR fixes <a class="issue tracker-1 status-11 priority-7 priority-high2 closed" title="Issue: Tasks stuck in waiting after restart of pulp services (CLOSED - CURRENTRELEASE)" href="https://pulp.plan.io/issues/2835">#2835</a> but breaks reconnect support. Hopefully this helps us narrow down the cause further.</p>
<p>To reproduce this, check out commit 3a3b5f020eca1d019f51301ffe5d9bc2dbffcdb2 on the pulp repo (one commit prior to the PR in question)</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=230932017-11-30T20:59:06Zrchan
<ul><li><strong>Sprint/Milestone</strong> changed from <i>47</i> to <i>48</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=234202017-12-18T14:56:58Zdalleydalley@redhat.com
<ul><li><strong>Status</strong> changed from <i>POST</i> to <i>ASSIGNED</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=235062017-12-19T16:18:00Zrchan
<ul><li><strong>Sprint/Milestone</strong> changed from <i>48</i> to <i>52</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=237512018-01-08T21:27:07Zrchan
<ul><li><strong>Sprint/Milestone</strong> changed from <i>52</i> to <i>53</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=241412018-01-22T17:58:37Zdalleydalley@redhat.com
<ul><li><strong>Status</strong> changed from <i>ASSIGNED</i> to <i>POST</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=242862018-01-29T16:59:48Zdalleydalley@redhat.com
<ul><li><strong>Status</strong> changed from <i>POST</i> to <i>MODIFIED</i></li></ul><p>Applied in changeset <a class="changeset" title="Fixes workers crashing on restart edge case Fixes an issue where workers would crash when attemp..." href="https://pulp.plan.io/projects/pulp/repository/pulp/revisions/aa7a1de219b02bb0aa5a3674a72d77a653ee968f">pulp|aa7a1de219b02bb0aa5a3674a72d77a653ee968f</a>.</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=245072018-02-12T19:58:26Zbmbouterbmbouter@redhat.com
<ul><li><strong>Platform Release</strong> set to <i>2.15.2</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=246252018-02-19T15:45:51Zdalleydalley@redhat.com
<ul></ul><p>Applied in changeset <a class="changeset" title="Fixes workers crashing on restart edge case Fixes an issue where workers would crash when attemp..." href="https://pulp.plan.io/projects/pulp/repository/pulp/revisions/76b756f3965bc5603f99c9427201a1a00b9fa585">pulp|76b756f3965bc5603f99c9427201a1a00b9fa585</a>.</p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=246552018-02-20T18:08:56Zdaviddavis
<ul><li><strong>Status</strong> changed from <i>MODIFIED</i> to <i>5</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=247682018-02-26T15:59:17Zpthomas@redhat.com
<ul></ul><p>Tested this by following the steps from <a class="issue tracker-3 status-9 priority-6 priority-default closed" title="Story: As a user, I can rest easy in the knowledge that my celery workers will ensure that their AMQP me... (CLOSED - WONTFIX)" href="https://pulp.plan.io/issues/15">#15</a></p> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=248212018-02-28T02:06:25Zbmbouterbmbouter@redhat.com
<ul><li><strong>Status</strong> changed from <i>5</i> to <i>CLOSED - CURRENTRELEASE</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=263442018-03-08T23:30:30Zbmbouterbmbouter@redhat.com
<ul><li><strong>Sprint</strong> set to <i>Sprint 31</i></li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=263692018-03-08T23:31:01Zbmbouterbmbouter@redhat.com
<ul><li><strong>Sprint/Milestone</strong> deleted (<del><i>53</i></del>)</li></ul> Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp serviceshttps://pulp.plan.io/issues/2835?journal_id=382212019-04-15T20:17:17Zbmbouterbmbouter@redhat.com
<ul><li><strong>Tags</strong> <i>Pulp 2</i> added</li></ul>