Issue #2734
closedtask group never quite finishes if pulp is restarted in the middle of task run
Description
NOTE: this bug was found in a product that uses pulp 2.8. I have not yet reproduced w/ plain pulp.
If you cancel a task in a task group, it appears that the task group is not updated.
For example:
- create a task group with many applicability regeneration tasks
- cancel tasks directly via `pulp-admin task cancel`
At this point, the task group should say that all the tasks are cancelled. However, it will show the tasks as in 'waiting' state still.
Related issues
Updated by ttereshc over 7 years ago
Pulp-only reproducer is needed to triage this bug.
Anyone, feel free to try to reproduce it.
cduryee, just in case you have time for it, that will be great :)
Updated by ttereshc over 7 years ago
I was not able to reproduce it (with Pulp 2.12.2), see below what I did.
There is a note in the docs that it is not guaranteed that task will be in the canceled state. I am not sure if this is the case, I would expect task end up in error or finished state then. Any thoughts on this?
So far, task cancellation works for me:
- I created several consumer profiles, synced several repositories.
- Then triggered applicability regeneration with parallel=true to have a task group
$ curl -H "Accept: application/json" -X POST -d '{"repo_criteria": {"filters": {"id": {"$in": ["epel7", "epel7v2", "zooe1", "zoo", "upd_verizon"]}}}, "parallel": true}' -k -u admin:admin 'https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/'
{"group_id": "9f882286-8d8a-42a8-acb1-08127bcd2e8c", "_href": "/pulp/api/v2/task_groups/9f882286-8d8a-42a8-acb1-08127bcd2e8c/"}
- canceled several tasks via pulp-admin
$ pulp-admin tasks cancel --task-id f0799dde-0b5a-4e4c-8272-1b2b2d1761d4
Task cancel is successfully initiated.
$ pulp-admin tasks cancel --task-id d56d5b69-505a-4cd1-9c5d-639e9870b880
Task cancel is successfully initiated.
- requested state_summary and everything looks correct
$ curl -H "Accept: application/json" -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/9f882286-8d8a-42a8-acb1-08127bcd2e8c/state_summary/' | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 127 100 127 0 0 973 0 --:--:-- --:--:-- --:--:-- 976
{
"accepted": 0,
"canceled": 2,
"error": 0,
"finished": 3,
"running": 0,
"skipped": 0,
"suspended": 0,
"total": 5,
"waiting": 0
}
Updated by ttereshc over 7 years ago
Same results for Pulp 2.8.7 upstream, I was not able to reproduce the issue.
Updated by ttereshc over 7 years ago
cduryee will update this issue with more info.
Updated by ttereshc over 7 years ago
- Status changed from NEW to CLOSED - WORKSFORME
I was not able to reproduce it, please put comment with additional info and we can re-open the issue if needed.
Updated by cduryee over 7 years ago
- Subject changed from cancelling task does not update task group to task group never quite finishes if pulp is restarted in the middle of task run
- Status changed from CLOSED - WORKSFORME to NEW
reopening, I finally was able to repro. Sorry for the delay.
- POST to https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/ for a big repo with lots of bound consumers.
- you'll get a task group ID back, ensure that if you curl the task group ID's state summary, you get a number that's decreasing
- kill celery (just regular, not -9) and stop and start all pulp services
- watch state summary go down again, it will not quite get to zero, but no tasks will be running
Updated by ttereshc over 7 years ago
- Priority changed from Normal to High
- Triaged changed from No to Yes
- Sprint Candidate changed from No to Yes
Updated by ttereshc over 7 years ago
Anyone who is going to work on this issue, feel free to ping me if you need help with step 1 of the reproducer (applicability task).
Btw, could this issue be related to #2835?
Updated by cduryee over 7 years ago
It could be similar to https://pulp.plan.io/issues/2861. I would not be surprised if fixing that fixes this as well.
Updated by ttereshc over 7 years ago
- Related to Issue #2835: Tasks stuck in waiting after restart of pulp services added
Updated by ttereshc over 7 years ago
- Related to Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl added
Updated by fdobrovo over 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to fdobrovo
Updated by fdobrovo over 7 years ago
I tried to reproduce this bug as suggested, but I never achieved the same result.
My steps:
1. Remove all repositories from pulp
2. Add centos repository
pulp-admin rpm repo create --repo-id=centos --feed=http://mirror.centos.org/centos/7/updates/x86_64/ --download-policy on_demand
3. Sync it
pulp-admin rpm repo sync run --repo-id centos
4. Run consumer generating script (https://pulp.plan.io/attachments/319/gen_test_data.py)
python gen_test_data.py
5. Regenerate consumer applicability and wait
curl -H "Accept: application/json" -X POST -d '{"consumer_criteria": {}}' -k -u admin:admin 'https://localhost/pulp/api/v2/consumers/actions/content/regenerate_applicability/'
6. Upload rpm to trigger repo change
pulp-admin rpm repo uploads --repo-id=centos --file=kangaroo-0.2-1.noarch.rpm
7. Generate repository aplicalibity
curl -H "Accept: application/json" -X POST -d '{"repo_criteria": {"filters": {"id": {"$in": ["centos"]}}}, "parallel": true}' -k -u admin:admin 'https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/'
8. Get task group status
curl -H "Accept: application/json" -X GET -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/<ID>/state_summary/'
9. Kill and stop all services
pkill celery; pstop
10. start all services
pstart
11. Observe task group status
curl -H "Accept: application/json" -X GET -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/<ID>/state_summary/'
The waiting will be getting lower and lower and WILL reach 0. But 4 running tasks never finish.
Before restart:
{"accepted": 0, "finished": 4, "running": 4, "canceled": 0, "waiting": 42, "skipped": 0, "suspended": 0, "error": 0, "total": 50}
After restart:
{"accepted": 0, "finished": 15, "running": 5, "canceled": 0, "waiting": 30, "skipped": 0, "suspended": 0, "error": 0, "total": 50}
Stucked state:
{"accepted": 0, "finished": 46, "running": 4, "canceled": 0, "waiting": 0, "skipped": 0, "suspended": 0, "error": 0, "total": 50}
One of the running tasks details:
$ pulp-admin -vv tasks details --task-id 1be03945-479a-4712-b698-3133a0b3d0b9
+----------------------------------------------------------------------+
Task Details
+----------------------------------------------------------------------+
2017-08-02 14:20:43,205 - DEBUG - sending GET request to /pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/
2017-08-02 14:20:43,237 - INFO - GET request to /pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/ with parameters None
2017-08-02 14:20:43,237 - INFO - Response status : 200
2017-08-02 14:20:43,237 - INFO - Response body :
{
"exception": null,
"task_type": "pulp.server.managers.consumer.applicability.batch_regenerate_applicability",
"_href": "/pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/",
"task_id": "1be03945-479a-4712-b698-3133a0b3d0b9",
"tags": [],
"finish_time": null,
"_ns": "task_status",
"start_time": "2017-08-02T13:38:29Z",
"traceback": null,
"spawned_tasks": [],
"progress_report": {},
"queue": "celery.dq",
"state": "running",
"worker_name": "celery",
"result": null,
"error": null,
"_id": {
"$oid": "5981d5b58bb3cb1f7e15dc81"
},
"id": "5981d5b58bb3cb1f7e15dc81"
}
Operations:
Resources:
State: Running
Start Time: 2017-08-02T13:38:29Z
Finish Time: Incomplete
Result: Incomplete
Task Id: 1be03945-479a-4712-b698-3133a0b3d0b9
Progress Report:
I experienced the same behaviour on 2.13-release as on pulp-2.8.7-1
Updated by daviddavis over 7 years ago
- Related to Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4 added
Updated by daviddavis over 7 years ago
- Related to deleted (Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4)
Updated by fdobrovo over 7 years ago
This bug has two parts:
1. A bug in Katello - I asked one of the devs from Katello for help, during that I found out that there is a slight misunderstanding of our API. Katello marks the task as running till the number of tasks marked as finished is not equal to the total number of tasks in a group. But in our API field 'finished' marks only tasks that have finished successfully. (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/webservices/views/task_groups.py#L66-L84 and https://github.com/pulp/pulp/blob/2.8-dev/common/pulp/common/constants.py#L26-L42). This is why the reproducer in the description is not reproducible with plain pulp. Even when part two is solved this has to be fixed. Because I believe that when running task is killed it should be marked as 'failed' and therefore Katello would be waiting forever again.
2. A bug in pulp - Most of the tasks that pulp handles are tasks with reservation called via apply_async_with_reservation (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L352) this causes the task to go through _queue_reserved_task Task (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L107-L160) in this function crucial thing happens a worker is selected and a new apply_async task is spawned with worker name as parameter (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L159) However generate applicability is called with apply_async (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/managers/consumer/applicability.py#L168) and therefore no worker is selected for such task and default routing key is saved to database instead. (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L440 and https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L430-L431). This leads to the fact that when the worker goes missing and then reappears the task won't get cancelled because it's not associated with Worker (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L255-L258)
Updated by ttereshc over 7 years ago
- Status changed from ASSIGNED to NEW
- Assignee deleted (
fdobrovo)
Updated by dalley over 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
Updated by dalley over 7 years ago
- Status changed from ASSIGNED to POST
Added by dalley over 7 years ago
Updated by dalley over 7 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulp|9bca6a9b2a5e560ff003258adef70d93268e5883.
Updated by dalley over 7 years ago
@Tanya, you should be aware that the fix for this (which is completely correct and proper) makes this vulnerable to a separate, and more serious problem in the tasking cleanup code. https://pulp.plan.io/issues/2835
That's entirely expected because prior to the fix it wasn't cleaning up those tasks properly, and now it is, so that code is being called...
There's a fix for 2835, but it definitely will not apply cleanly backwards. I'm trying to figure out a more straightforward and less involved way to fix it on pre-2.13 pulp for the sake of downstream.
Updated by dalley over 7 years ago
I verified that 2835 only affects versions of pulp >= 2.13 so my last comment can be disregarded with respect to downstream.
Added by dalley over 7 years ago
Revision fc9b0208 | View on GitHub
Fix some tasks failing to cancel after pulp restart
Fixes an issue where tasks which go through apply_async() directly fail to be associated with a worker, and thus are not properly cancelled when the worker is restarted mid-execution and performs cleanup.
closes #2734 https://pulp.plan.io/issues/2734
(cherry picked from commit 9bca6a9b2a5e560ff003258adef70d93268e5883)
Updated by dalley over 7 years ago
Applied in changeset pulp|fc9b02084241f9d3a1db1025f2c71cf1af0bcd8e.
Updated by pthomas@redhat.com about 7 years ago
Followed the steps from comment#15 to verify this in pulp 2.13.4
Updated by pcreech about 7 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Fix some tasks failing to cancel after pulp restart
Fixes an issue where tasks which go through apply_async() directly fail to be associated with a worker, and thus are not properly cancelled when the worker is restarted mid-execution and performs cleanup.
closes #2734 https://pulp.plan.io/issues/2734