Project

Profile

Help

Issue #6449

closed

Tasks stuck in Waiting state

Added by binlinf0 over 4 years ago. Updated about 4 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
Platform Release:
OS:
RHEL 7
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 81
Quarter:

Description

We have been seeing many waiting tasks. They seem to be stuck forever. e.g.

./get /pulp/api/v3/tasks/14b76b27-9f34-4297-88ed-5ec13cbe5e50/

HTTP/1.1 200 OK Allow: GET, PATCH, DELETE, HEAD, OPTIONS Connection: keep-alive Content-Length: 323 Content-Type: application/json Date: Fri, 03 Apr 2020 12:56:02 GMT Server: nginx/1.16.1 Vary: Accept, Cookie X-Frame-Options: SAMEORIGIN

{ "created_resources": [], "error": null, "finished_at": null, "name": "pulpcore.app.tasks.base.general_update", "progress_reports": [], "pulp_created": "2020-04-02T13:00:14.881212Z", "pulp_href": "/pulp/api/v3/tasks/14b76b27-9f34-4297-88ed-5ec13cbe5e50/", "reserved_resources_record": [], "started_at": null, "state": "waiting", "worker": null }

Per Brian "So the problematic thing I see in this output is the "resource-manager | 0". This tells me that Pulp's record of the task is in postgresql (and was never run), but RQ has lost the task from the "resource-manager" queue in Redis. So the next question is how did that happen?"


Related issues

Related to Pulp - Issue #7119: Tasks stay in waiting state if worker that had resource reservation goneCLOSED - NOTABUGActions
Related to Pulp - Issue #7387: Tasks not delivered to resource-manager are not cleaned upCLOSED - DUPLICATEActions
Has duplicate Pulp - Issue #7387: Tasks not delivered to resource-manager are not cleaned upCLOSED - DUPLICATEActions
Actions #1

Updated by daviddavis over 4 years ago

This seems to produce stuck waiting tasks for me:

export I=$RANDOM

http POST $BASE_ADDR/pulp/api/v3/remotes/file/file/ name=foo$I url=bar                                                   
export REMOTE_HREF=$(http $BASE_ADDR/pulp/api/v3/remotes/file/file/ | jq -r '.results[0] | .pulp_href')

sudo systemctl stop redis
http PATCH :$REMOTE_HREF name=test$I
sudo systemctl start redis

# this task will be stuck forever and all future tasks get stuck too
http PATCH :$REMOTE_HREF name=test
Actions #2

Updated by fao89 over 4 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 70
Actions #3

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 70 to Sprint 71
Actions #4

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 71 to Sprint 72
Actions #5

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 72 to Sprint 73
Actions #6

Updated by dalley over 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley
Actions #7

Updated by pulpbot over 4 years ago

  • Status changed from ASSIGNED to POST
Actions #8

Updated by dalley over 4 years ago

Pasting this one comment from github for historical record:

I might know what it is. The task is spawned onto a queue with the name of the worker they are assigned to. This worker is determined via _acquire_worker with a database query. When redis goes down, all the workers get killed off (and recreated), but their records remain in the database for 30 seconds until they get cleaned up in the background.

If it happens quickly enough, it could try to assign the task to a dead worker, and dump the task into a queue for the dead worker. That should be easy enough to verify...

This would have been introduced when we started using a random(ish?) number component in the worker names.

Actions #9

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 73 to Sprint 74
Actions #10

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 74 to Sprint 75
Actions #11

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 75 to Sprint 76
Actions #12

Updated by hinduparv over 4 years ago

  • File As we grew up, my brothers acted like theydidn’t care, but I always knew theylooked out for me and were there!– Happy Raksha Bandhan!.jpg added
Actions #13

Updated by dalley over 4 years ago

  • File deleted (As we grew up, my brothers acted like theydidn’t care, but I always knew theylooked out for me and were there!– Happy Raksha Bandhan!.jpg)
Actions #14

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 76 to Sprint 77
Actions #15

Updated by dalley over 4 years ago

  • Related to Issue #7119: Tasks stay in waiting state if worker that had resource reservation gone added
Actions #16

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 77 to Sprint 78
Actions #17

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 78 to Sprint 79
Actions #18

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 79 to Sprint 80
Actions #19

Updated by fao89 over 4 years ago

  • Related to Issue #7387: Tasks not delivered to resource-manager are not cleaned up added
Actions #20

Updated by daviddavis over 4 years ago

  • Has duplicate Issue #7387: Tasks not delivered to resource-manager are not cleaned up added
Actions #21

Updated by daviddavis over 4 years ago

  • Priority changed from Normal to High
Actions #22

Updated by daviddavis over 4 years ago

Setting the priority to high since Ansible also reported this.

Actions #23

Updated by rchan over 4 years ago

  • Sprint changed from Sprint 80 to Sprint 81

Added by dalley over 4 years ago

Revision fd986b4e | View on GitHub

Fix tasks stuck in waiting state on redis connection failure

closes: #6449 https://pulp.plan.io/issues/6449

Actions #25

Updated by dalley over 4 years ago

  • Status changed from POST to MODIFIED
Actions #26

Updated by bmbouter about 4 years ago

  • Sprint/Milestone set to 3.7.0
Actions #27

Updated by pulpbot about 4 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF