Project

Profile

Help

Issue #6449

Tasks stuck in Waiting state

Added by binlinf0 7 months ago. Updated 28 days ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
Platform Release:
OS:
RHEL 7
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 81
Quarter:

Description

We have been seeing many waiting tasks. They seem to be stuck forever. e.g.

./get /pulp/api/v3/tasks/14b76b27-9f34-4297-88ed-5ec13cbe5e50/

HTTP/1.1 200 OK Allow: GET, PATCH, DELETE, HEAD, OPTIONS Connection: keep-alive Content-Length: 323 Content-Type: application/json Date: Fri, 03 Apr 2020 12:56:02 GMT Server: nginx/1.16.1 Vary: Accept, Cookie X-Frame-Options: SAMEORIGIN

{ "created_resources": [], "error": null, "finished_at": null, "name": "pulpcore.app.tasks.base.general_update", "progress_reports": [], "pulp_created": "2020-04-02T13:00:14.881212Z", "pulp_href": "/pulp/api/v3/tasks/14b76b27-9f34-4297-88ed-5ec13cbe5e50/", "reserved_resources_record": [], "started_at": null, "state": "waiting", "worker": null }

Per Brian "So the problematic thing I see in this output is the "resource-manager | 0". This tells me that Pulp's record of the task is in postgresql (and was never run), but RQ has lost the task from the "resource-manager" queue in Redis. So the next question is how did that happen?"


Related issues

Related to Pulp - Issue #7119: Tasks stay in waiting state if worker that had resource reservation goneCLOSED - NOTABUG<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Issue #7387: Tasks not delivered to resource-manager are not cleaned upCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Has duplicate Pulp - Issue #7387: Tasks not delivered to resource-manager are not cleaned upCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision fd986b4e View on GitHub
Added by dalley about 1 month ago

Fix tasks stuck in waiting state on redis connection failure

closes: #6449 https://pulp.plan.io/issues/6449

History

#1 Updated by daviddavis 7 months ago

This seems to produce stuck waiting tasks for me:

export I=$RANDOM

http POST $BASE_ADDR/pulp/api/v3/remotes/file/file/ name=foo$I url=bar                                                   
export REMOTE_HREF=$(http $BASE_ADDR/pulp/api/v3/remotes/file/file/ | jq -r '.results[0] | .pulp_href')

sudo systemctl stop redis
http PATCH :$REMOTE_HREF name=test$I
sudo systemctl start redis

# this task will be stuck forever and all future tasks get stuck too
http PATCH :$REMOTE_HREF name=test

#2 Updated by fao89 7 months ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 70

#3 Updated by rchan 6 months ago

  • Sprint changed from Sprint 70 to Sprint 71

#4 Updated by rchan 6 months ago

  • Sprint changed from Sprint 71 to Sprint 72

#5 Updated by rchan 5 months ago

  • Sprint changed from Sprint 72 to Sprint 73

#6 Updated by dalley 5 months ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley

#7 Updated by pulpbot 5 months ago

  • Status changed from ASSIGNED to POST

#8 Updated by dalley 5 months ago

Pasting this one comment from github for historical record:

I might know what it is. The task is spawned onto a queue with the name of the worker they are assigned to. This worker is determined via _acquire_worker with a database query. When redis goes down, all the workers get killed off (and recreated), but their records remain in the database for 30 seconds until they get cleaned up in the background.

If it happens quickly enough, it could try to assign the task to a dead worker, and dump the task into a queue for the dead worker. That should be easy enough to verify...

This would have been introduced when we started using a random(ish?) number component in the worker names.

#9 Updated by rchan 5 months ago

  • Sprint changed from Sprint 73 to Sprint 74

#10 Updated by rchan 4 months ago

  • Sprint changed from Sprint 74 to Sprint 75

#11 Updated by rchan 4 months ago

  • Sprint changed from Sprint 75 to Sprint 76

#12 Updated by hinduparv 4 months ago

  • File As we grew up, my brothers acted like theydidn’t care, but I always knew theylooked out for me and were there!– Happy Raksha Bandhan!.jpg added

#13 Updated by dalley 3 months ago

  • File deleted (As we grew up, my brothers acted like theydidn’t care, but I always knew theylooked out for me and were there!– Happy Raksha Bandhan!.jpg)

#14 Updated by rchan 3 months ago

  • Sprint changed from Sprint 76 to Sprint 77

#15 Updated by dalley 3 months ago

  • Related to Issue #7119: Tasks stay in waiting state if worker that had resource reservation gone added

#16 Updated by rchan 3 months ago

  • Sprint changed from Sprint 77 to Sprint 78

#17 Updated by rchan 2 months ago

  • Sprint changed from Sprint 78 to Sprint 79

#18 Updated by rchan about 2 months ago

  • Sprint changed from Sprint 79 to Sprint 80

#19 Updated by fao89 about 2 months ago

  • Related to Issue #7387: Tasks not delivered to resource-manager are not cleaned up added

#20 Updated by daviddavis about 2 months ago

  • Has duplicate Issue #7387: Tasks not delivered to resource-manager are not cleaned up added

#21 Updated by daviddavis about 2 months ago

  • Priority changed from Normal to High

#22 Updated by daviddavis about 2 months ago

Setting the priority to high since Ansible also reported this.

#23 Updated by rchan about 2 months ago

  • Sprint changed from Sprint 80 to Sprint 81

#25 Updated by dalley about 1 month ago

  • Status changed from POST to MODIFIED

#26 Updated by bmbouter 28 days ago

  • Sprint/Milestone set to 3.7.0

#27 Updated by pulpbot 28 days ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Please register to edit this issue

Also available in: Atom PDF