Project

Profile

Help

Issue #2734

closed

task group never quite finishes if pulp is restarted in the middle of task run

Added by cduryee over 7 years ago. Updated over 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.14.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 24
Quarter:

Description

NOTE: this bug was found in a product that uses pulp 2.8. I have not yet reproduced w/ plain pulp.

If you cancel a task in a task group, it appears that the task group is not updated.

For example:

  • create a task group with many applicability regeneration tasks
  • cancel tasks directly via `pulp-admin task cancel`

At this point, the task group should say that all the tasks are cancelled. However, it will show the tasks as in 'waiting' state still.


Related issues

Related to Pulp - Issue #2835: Tasks stuck in waiting after restart of pulp servicesCLOSED - CURRENTRELEASEdalleyActions
Related to Pulp - Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctlCLOSED - WONTFIXdaviddavisActions
Actions #1

Updated by ttereshc over 7 years ago

Pulp-only reproducer is needed to triage this bug.
Anyone, feel free to try to reproduce it.

cduryee, just in case you have time for it, that will be great :)

Actions #2

Updated by ttereshc over 7 years ago

I was not able to reproduce it (with Pulp 2.12.2), see below what I did.
There is a note in the docs that it is not guaranteed that task will be in the canceled state. I am not sure if this is the case, I would expect task end up in error or finished state then. Any thoughts on this?

So far, task cancellation works for me:
- I created several consumer profiles, synced several repositories.
- Then triggered applicability regeneration with parallel=true to have a task group

$ curl -H "Accept: application/json" -X POST -d '{"repo_criteria": {"filters": {"id": {"$in": ["epel7", "epel7v2", "zooe1", "zoo", "upd_verizon"]}}}, "parallel": true}' -k -u admin:admin 'https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/'

{"group_id": "9f882286-8d8a-42a8-acb1-08127bcd2e8c", "_href": "/pulp/api/v2/task_groups/9f882286-8d8a-42a8-acb1-08127bcd2e8c/"}

- canceled several tasks via pulp-admin

$ pulp-admin tasks cancel --task-id f0799dde-0b5a-4e4c-8272-1b2b2d1761d4
Task cancel is successfully initiated.

$ pulp-admin tasks cancel --task-id d56d5b69-505a-4cd1-9c5d-639e9870b880
Task cancel is successfully initiated.

- requested state_summary and everything looks correct

$ curl -H "Accept: application/json" -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/9f882286-8d8a-42a8-acb1-08127bcd2e8c/state_summary/' | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   127  100   127    0     0    973      0 --:--:-- --:--:-- --:--:--   976
{
    "accepted": 0,
    "canceled": 2,
    "error": 0,
    "finished": 3,
    "running": 0,
    "skipped": 0,
    "suspended": 0,
    "total": 5,
    "waiting": 0
}
Actions #3

Updated by ttereshc over 7 years ago

Same results for Pulp 2.8.7 upstream, I was not able to reproduce the issue.

Actions #4

Updated by ttereshc over 7 years ago

cduryee will update this issue with more info.

Actions #5

Updated by ttereshc over 7 years ago

  • Status changed from NEW to CLOSED - WORKSFORME

I was not able to reproduce it, please put comment with additional info and we can re-open the issue if needed.

Actions #6

Updated by cduryee over 7 years ago

  • Subject changed from cancelling task does not update task group to task group never quite finishes if pulp is restarted in the middle of task run
  • Status changed from CLOSED - WORKSFORME to NEW

reopening, I finally was able to repro. Sorry for the delay.

  • POST to https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/ for a big repo with lots of bound consumers.
  • you'll get a task group ID back, ensure that if you curl the task group ID's state summary, you get a number that's decreasing
  • kill celery (just regular, not -9) and stop and start all pulp services
  • watch state summary go down again, it will not quite get to zero, but no tasks will be running
Actions #8

Updated by ttereshc over 7 years ago

  • Priority changed from Normal to High
  • Triaged changed from No to Yes
  • Sprint Candidate changed from No to Yes
Actions #9

Updated by ttereshc over 7 years ago

  • Sprint/Milestone set to 42
Actions #10

Updated by ttereshc over 7 years ago

Anyone who is going to work on this issue, feel free to ping me if you need help with step 1 of the reproducer (applicability task).

Btw, could this issue be related to #2835?

Actions #11

Updated by cduryee over 7 years ago

It could be similar to https://pulp.plan.io/issues/2861. I would not be surprised if fixing that fixes this as well.

Actions #12

Updated by ttereshc over 7 years ago

  • Related to Issue #2835: Tasks stuck in waiting after restart of pulp services added
Actions #13

Updated by ttereshc over 7 years ago

  • Related to Issue #2861: Queued tasks will be lost if the resource manager is restarted via systemctl added
Actions #14

Updated by fdobrovo over 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to fdobrovo
Actions #15

Updated by fdobrovo over 7 years ago

I tried to reproduce this bug as suggested, but I never achieved the same result.

My steps:
1. Remove all repositories from pulp
2. Add centos repository

pulp-admin rpm repo create --repo-id=centos --feed=http://mirror.centos.org/centos/7/updates/x86_64/ --download-policy on_demand

3. Sync it

pulp-admin rpm repo sync run --repo-id centos

4. Run consumer generating script (https://pulp.plan.io/attachments/319/gen_test_data.py)

python gen_test_data.py

5. Regenerate consumer applicability and wait

curl -H "Accept: application/json" -X POST -d '{"consumer_criteria": {}}' -k -u admin:admin 'https://localhost/pulp/api/v2/consumers/actions/content/regenerate_applicability/'

6. Upload rpm to trigger repo change

pulp-admin rpm repo uploads --repo-id=centos --file=kangaroo-0.2-1.noarch.rpm

7. Generate repository aplicalibity

curl -H "Accept: application/json" -X POST -d '{"repo_criteria": {"filters": {"id": {"$in": ["centos"]}}}, "parallel": true}' -k -u admin:admin 'https://localhost/pulp/api/v2/repositories/actions/content/regenerate_applicability/'

8. Get task group status

curl -H "Accept: application/json" -X GET  -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/<ID>/state_summary/'

9. Kill and stop all services

pkill celery; pstop

10. start all services

pstart

11. Observe task group status

curl -H "Accept: application/json" -X GET  -k -u admin:admin 'https://localhost/pulp/api/v2/task_groups/<ID>/state_summary/'

The waiting will be getting lower and lower and WILL reach 0. But 4 running tasks never finish.

Before restart:
{"accepted": 0, "finished": 4, "running": 4, "canceled": 0, "waiting": 42, "skipped": 0, "suspended": 0, "error": 0, "total": 50}
After restart:
{"accepted": 0, "finished": 15, "running": 5, "canceled": 0, "waiting": 30, "skipped": 0, "suspended": 0, "error": 0, "total": 50}
Stucked state:
{"accepted": 0, "finished": 46, "running": 4, "canceled": 0, "waiting": 0, "skipped": 0, "suspended": 0, "error": 0, "total": 50}

One of the running tasks details:

$ pulp-admin -vv tasks details --task-id 1be03945-479a-4712-b698-3133a0b3d0b9
+----------------------------------------------------------------------+
                              Task Details
+----------------------------------------------------------------------+

2017-08-02 14:20:43,205 - DEBUG - sending GET request to /pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/
2017-08-02 14:20:43,237 - INFO - GET request to /pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/ with parameters None
2017-08-02 14:20:43,237 - INFO - Response status : 200 

2017-08-02 14:20:43,237 - INFO - Response body :
 {
  "exception": null, 
  "task_type": "pulp.server.managers.consumer.applicability.batch_regenerate_applicability", 
  "_href": "/pulp/api/v2/tasks/1be03945-479a-4712-b698-3133a0b3d0b9/", 
  "task_id": "1be03945-479a-4712-b698-3133a0b3d0b9", 
  "tags": [], 
  "finish_time": null, 
  "_ns": "task_status", 
  "start_time": "2017-08-02T13:38:29Z", 
  "traceback": null, 
  "spawned_tasks": [], 
  "progress_report": {}, 
  "queue": "celery.dq", 
  "state": "running", 
  "worker_name": "celery", 
  "result": null, 
  "error": null, 
  "_id": {
    "$oid": "5981d5b58bb3cb1f7e15dc81"
  }, 
  "id": "5981d5b58bb3cb1f7e15dc81"
}

Operations:       
Resources:        
State:            Running
Start Time:       2017-08-02T13:38:29Z
Finish Time:      Incomplete
Result:           Incomplete
Task Id:          1be03945-479a-4712-b698-3133a0b3d0b9
Progress Report:  

I experienced the same behaviour on 2.13-release as on pulp-2.8.7-1

Actions #16

Updated by daviddavis over 7 years ago

  • Related to Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4 added
Actions #17

Updated by daviddavis over 7 years ago

  • Related to deleted (Issue #2954: Ensure that queued tasks are not lost by enabling task_reject_on_worker_lost for Celery 4)
Actions #18

Updated by mhrivnak over 7 years ago

  • Sprint/Milestone changed from 42 to 43
Actions #19

Updated by fdobrovo over 7 years ago

This bug has two parts:

1. A bug in Katello - I asked one of the devs from Katello for help, during that I found out that there is a slight misunderstanding of our API. Katello marks the task as running till the number of tasks marked as finished is not equal to the total number of tasks in a group. But in our API field 'finished' marks only tasks that have finished successfully. (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/webservices/views/task_groups.py#L66-L84 and https://github.com/pulp/pulp/blob/2.8-dev/common/pulp/common/constants.py#L26-L42). This is why the reproducer in the description is not reproducible with plain pulp. Even when part two is solved this has to be fixed. Because I believe that when running task is killed it should be marked as 'failed' and therefore Katello would be waiting forever again.

2. A bug in pulp - Most of the tasks that pulp handles are tasks with reservation called via apply_async_with_reservation (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L352) this causes the task to go through _queue_reserved_task Task (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L107-L160) in this function crucial thing happens a worker is selected and a new apply_async task is spawned with worker name as parameter (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L159) However generate applicability is called with apply_async (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/managers/consumer/applicability.py#L168) and therefore no worker is selected for such task and default routing key is saved to database instead. (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L440 and https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L430-L431). This leads to the fact that when the worker goes missing and then reappears the task won't get cancelled because it's not associated with Worker (https://github.com/pulp/pulp/blob/2.8-dev/server/pulp/server/async/tasks.py#L255-L258)

Actions #20

Updated by ttereshc over 7 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (fdobrovo)
Actions #21

Updated by dalley about 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley
Actions #22

Updated by dalley about 7 years ago

  • Status changed from ASSIGNED to POST

Added by dalley about 7 years ago

Revision 9bca6a9b | View on GitHub

Fix some tasks failing to cancel after pulp restart

Fixes an issue where tasks which go through apply_async() directly fail to be associated with a worker, and thus are not properly cancelled when the worker is restarted mid-execution and performs cleanup.

closes #2734 https://pulp.plan.io/issues/2734

Actions #23

Updated by dalley about 7 years ago

  • Status changed from POST to MODIFIED
Actions #24

Updated by dalley about 7 years ago

@Tanya, you should be aware that the fix for this (which is completely correct and proper) makes this vulnerable to a separate, and more serious problem in the tasking cleanup code. https://pulp.plan.io/issues/2835

That's entirely expected because prior to the fix it wasn't cleaning up those tasks properly, and now it is, so that code is being called...

There's a fix for 2835, but it definitely will not apply cleanly backwards. I'm trying to figure out a more straightforward and less involved way to fix it on pre-2.13 pulp for the sake of downstream.

Actions #25

Updated by dalley about 7 years ago

I verified that 2835 only affects versions of pulp >= 2.13 so my last comment can be disregarded with respect to downstream.

Added by dalley about 7 years ago

Revision fc9b0208 | View on GitHub

Fix some tasks failing to cancel after pulp restart

Fixes an issue where tasks which go through apply_async() directly fail to be associated with a worker, and thus are not properly cancelled when the worker is restarted mid-execution and performs cleanup.

closes #2734 https://pulp.plan.io/issues/2734

(cherry picked from commit 9bca6a9b2a5e560ff003258adef70d93268e5883)

Actions #27

Updated by pthomas@redhat.com about 7 years ago

Followed the steps from comment#15 to verify this in pulp 2.13.4

Actions #28

Updated by pcreech about 7 years ago

  • Platform Release set to 2.14.1
Actions #29

Updated by pcreech about 7 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Actions #31

Updated by bmbouter over 6 years ago

  • Sprint set to Sprint 24
Actions #32

Updated by bmbouter over 6 years ago

  • Sprint/Milestone deleted (43)
Actions #33

Updated by bmbouter over 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF