Issue #2734: task group never quite finishes if pulp is restarted in the middle of task run - Pulp

Actions

Send by e-mail Copy link

Issue #2734

closed

task group never quite finishes if pulp is restarted in the middle of task run

Added by cduryee over 7 years ago. Updated almost 6 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

High

Assignee:

dalley

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

2.14.1

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Yes

Tags:

Pulp 2

Sprint:

Sprint 24

Quarter:

Description

NOTE: this bug was found in a product that uses pulp 2.8. I have not yet reproduced w/ plain pulp.

If you cancel a task in a task group, it appears that the task group is not updated.

For example:

create a task group with many applicability regeneration tasks
cancel tasks directly via `pulp-admin task cancel`

At this point, the task group should say that all the tasks are cancelled. However, it will show the tasks as in 'waiting' state still.

Related issues

Actions

Copy link

Updated by ttereshc over 7 years ago

Pulp-only reproducer is needed to triage this bug.
Anyone, feel free to try to reproduce it.

cduryee, just in case you have time for it, that will be great :)

Actions

Copy link

Updated by ttereshc over 7 years ago

I was not able to reproduce it (with Pulp 2.12.2), see below what I did.
There is a note in the docs that it is not guaranteed that task will be in the canceled state. I am not sure if this is the case, I would expect task end up in error or finished state then. Any thoughts on this?

So far, task cancellation works for me:
- I created several consumer profiles, synced several repositories.
- Then triggered applicability regeneration with parallel=true to have a task group

$ curl -H "Accept: application/json" -X POST -d '{"repo_criteria": {"filters": {"id": {"$in": ["epel7", "epel7v2", "zooe1", "zoo", "upd_verizon"]}}}, "parallel": true}' -k -u admin:admin 'https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/'

{"group_id": "9f882286-8d8a-42a8-acb1-08127bcd2e8c", "_href": "/pulp/api/v2/task_groups/9f882286-8d8a-42a8-acb1-08127bcd2e8c/"}

- canceled several tasks via pulp-admin

$ pulp-admin tasks cancel --task-id f0799dde-0b5a-4e4c-8272-1b2b2d1761d4
Task cancel is successfully initiated.

$ pulp-admin tasks cancel --task-id d56d5b69-505a-4cd1-9c5d-639e9870b880
Task cancel is successfully initiated.

- requested state_summary and everything looks correct

$ curl -H "Accept: application/json" -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/9f882286-8d8a-42a8-acb1-08127bcd2e8c/state_summary/' | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   127  100   127    0     0    973      0 --:--:-- --:--:-- --:--:--   976
{
    "accepted": 0,
    "canceled": 2,
    "error": 0,
    "finished": 3,
    "running": 0,
    "skipped": 0,
    "suspended": 0,
    "total": 5,
    "waiting": 0
}

Actions

Copy link

Updated by ttereshc over 7 years ago

Same results for Pulp 2.8.7 upstream, I was not able to reproduce the issue.

Actions

Copy link

Updated by ttereshc over 7 years ago

cduryee will update this issue with more info.

Actions

Copy link

Updated by ttereshc over 7 years ago

Status changed from NEW to CLOSED - WORKSFORME

I was not able to reproduce it, please put comment with additional info and we can re-open the issue if needed.

Actions

Copy link

Updated by cduryee over 7 years ago

Subject changed from cancelling task does not update task group to task group never quite finishes if pulp is restarted in the middle of task run
Status changed from CLOSED - WORKSFORME to NEW

reopening, I finally was able to repro. Sorry for the delay.

POST to https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/ for a big repo with lots of bound consumers.
you'll get a task group ID back, ensure that if you curl the task group ID's state summary, you get a number that's decreasing
kill celery (just regular, not -9) and stop and start all pulp services
watch state summary go down again, it will not quite get to zero, but no tasks will be running

Actions

Copy link

Updated by ttereshc over 7 years ago

Priority changed from Normal to High
Triaged changed from No to Yes
Sprint Candidate changed from No to Yes

Actions

Copy link

Updated by ttereshc over 7 years ago

Sprint/Milestone set to 42

Actions

Copy link

#10

Updated by ttereshc over 7 years ago

Anyone who is going to work on this issue, feel free to ping me if you need help with step 1 of the reproducer (applicability task).

Btw, could this issue be related to #2835?

Actions

Copy link

#11

Updated by cduryee over 7 years ago

It could be similar to https://pulp.plan.io/issues/2861. I would not be surprised if fixing that fixes this as well.

Actions

Copy link

#12

Updated by ttereshc over 7 years ago

Related to Issue #2835: Tasks stuck in waiting after restart of pulp services added

Actions

Copy link

#13

Updated by ttereshc over 7 years ago

Related to Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl added

Actions

Copy link

#14

Updated by fdobrovo over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to fdobrovo

Actions

Copy link

#15

Updated by fdobrovo over 7 years ago

I tried to reproduce this bug as suggested, but I never achieved the same result.

My steps:
1. Remove all repositories from pulp
2. Add centos repository

pulp-admin rpm repo create --repo-id=centos --feed=http://mirror.centos.org/centos/7/updates/x86_64/ --download-policy on_demand

3. Sync it

pulp-admin rpm repo sync run --repo-id centos

4. Run consumer generating script (https://pulp.plan.io/attachments/319/gen_test_data.py)

python gen_test_data.py

5. Regenerate consumer applicability and wait

curl -H "Accept: application/json" -X POST -d '{"consumer_criteria": {}}' -k -u admin:admin 'https://localhost/pulp/api/v2/consumers/actions/content/regenerate_applicability/'

6. Upload rpm to trigger repo change

pulp-admin rpm repo uploads --repo-id=centos --file=kangaroo-0.2-1.noarch.rpm

7. Generate repository aplicalibity

curl -H "Accept: application/json" -X POST -d '{"repo_criteria": {"filters": {"id": {"$in": ["centos"]}}}, "parallel": true}' -k -u admin:admin 'https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/'

8. Get task group status

curl -H "Accept: application/json" -X GET  -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/<ID>/state_summary/'

9. Kill and stop all services

pkill celery; pstop

10. start all services

pstart

11. Observe task group status

curl -H "Accept: application/json" -X GET  -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/<ID>/state_summary/'

The waiting will be getting lower and lower and WILL reach 0. But 4 running tasks never finish.

Before restart:
{"accepted": 0, "finished": 4, "running": 4, "canceled": 0, "waiting": 42, "skipped": 0, "suspended": 0, "error": 0, "total": 50}
After restart:
{"accepted": 0, "finished": 15, "running": 5, "canceled": 0, "waiting": 30, "skipped": 0, "suspended": 0, "error": 0, "total": 50}
Stucked state:
{"accepted": 0, "finished": 46, "running": 4, "canceled": 0, "waiting": 0, "skipped": 0, "suspended": 0, "error": 0, "total": 50}

One of the running tasks details:

$ pulp-admin -vv tasks details --task-id 1be03945-479a-4712-b698-3133a0b3d0b9
+----------------------------------------------------------------------+
                              Task Details
+----------------------------------------------------------------------+

2017-08-02 14:20:43,205 - DEBUG - sending GET request to /pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/
2017-08-02 14:20:43,237 - INFO - GET request to /pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/ with parameters None
2017-08-02 14:20:43,237 - INFO - Response status : 200 

2017-08-02 14:20:43,237 - INFO - Response body :
 {
  "exception": null, 
  "task_type": "pulp.server.managers.consumer.applicability.batch_regenerate_applicability", 
  "_href": "/pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/", 
  "task_id": "1be03945-479a-4712-b698-3133a0b3d0b9", 
  "tags": [], 
  "finish_time": null, 
  "_ns": "task_status", 
  "start_time": "2017-08-02T13:38:29Z", 
  "traceback": null, 
  "spawned_tasks": [], 
  "progress_report": {}, 
  "queue": "celery.dq", 
  "state": "running", 
  "worker_name": "celery", 
  "result": null, 
  "error": null, 
  "_id": {
    "$oid": "5981d5b58bb3cb1f7e15dc81"
  }, 
  "id": "5981d5b58bb3cb1f7e15dc81"
}

Operations:       
Resources:        
State:            Running
Start Time:       2017-08-02T13:38:29Z
Finish Time:      Incomplete
Result:           Incomplete
Task Id:          1be03945-479a-4712-b698-3133a0b3d0b9
Progress Report:

I experienced the same behaviour on 2.13-release as on pulp-2.8.7-1

Actions

Copy link

#16

Updated by daviddavis over 7 years ago

Related to Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4 added

Actions

Copy link

#17

Updated by daviddavis over 7 years ago

Related to deleted (Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4)

Actions

Copy link

#18

Updated by mhrivnak over 7 years ago

Sprint/Milestone changed from 42 to 43

Actions

Copy link

#19

Updated by fdobrovo over 7 years ago

This bug has two parts:

1. A bug in Katello - I asked one of the devs from Katello for help, during that I found out that there is a slight misunderstanding of our API. Katello marks the task as running till the number of tasks marked as finished is not equal to the total number of tasks in a group. But in our API field 'finished' marks only tasks that have finished successfully. (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/webservices/views/task_groups.py#L66-L84 and https://github.com/pulp/pulp/blob/2.8-dev/common/pulp/common/constants.py#L26-L42). This is why the reproducer in the description is not reproducible with plain pulp. Even when part two is solved this has to be fixed. Because I believe that when running task is killed it should be marked as 'failed' and therefore Katello would be waiting forever again.

2. A bug in pulp - Most of the tasks that pulp handles are tasks with reservation called via apply_async_with_reservation (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L352) this causes the task to go through _queue_reserved_task Task (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L107-L160) in this function crucial thing happens a worker is selected and a new apply_async task is spawned with worker name as parameter (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L159) However generate applicability is called with apply_async (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/managers/consumer/applicability.py#L168) and therefore no worker is selected for such task and default routing key is saved to database instead. (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L440 and https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L430-L431). This leads to the fact that when the worker goes missing and then reappears the task won't get cancelled because it's not associated with Worker (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L255-L258)

Actions

Copy link

#20

Updated by ttereshc over 7 years ago

Status changed from ASSIGNED to NEW
Assignee deleted (~~fdobrovo~~)

Actions

Copy link

#21

Updated by dalley over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to dalley

Actions

Copy link

#22

Updated by dalley over 7 years ago

Status changed from ASSIGNED to POST

PR here:
https://github.com/pulp/pulp/pull/3127

Added by dalley over 7 years ago

Revision 9bca6a9b | View on GitHub

Fix some tasks failing to cancel after pulp restart

Fixes an issue where tasks which go through apply_async() directly fail to be associated with a worker, and thus are not properly cancelled when the worker is restarted mid-execution and performs cleanup.

closes #2734 https://pulp.plan.io/issues/2734

Actions

Copy link

#23

Updated by dalley over 7 years ago

Status changed from POST to MODIFIED

Applied in changeset pulp|9bca6a9b2a5e560ff003258adef70d93268e5883.

Actions

Copy link

#24

Updated by dalley over 7 years ago

@Tanya, you should be aware that the fix for this (which is completely correct and proper) makes this vulnerable to a separate, and more serious problem in the tasking cleanup code. https://pulp.plan.io/issues/2835

That's entirely expected because prior to the fix it wasn't cleaning up those tasks properly, and now it is, so that code is being called...

There's a fix for 2835, but it definitely will not apply cleanly backwards. I'm trying to figure out a more straightforward and less involved way to fix it on pre-2.13 pulp for the sake of downstream.

Actions

Copy link

#25