Issue #6449
closedTasks stuck in Waiting state
Description
We have been seeing many waiting tasks. They seem to be stuck forever. e.g.
./get /pulp/api/v3/tasks/14b76b27-9f34-4297-88ed-5ec13cbe5e50/¶
HTTP/1.1 200 OK Allow: GET, PATCH, DELETE, HEAD, OPTIONS Connection: keep-alive Content-Length: 323 Content-Type: application/json Date: Fri, 03 Apr 2020 12:56:02 GMT Server: nginx/1.16.1 Vary: Accept, Cookie X-Frame-Options: SAMEORIGIN
{ "created_resources": [], "error": null, "finished_at": null, "name": "pulpcore.app.tasks.base.general_update", "progress_reports": [], "pulp_created": "2020-04-02T13:00:14.881212Z", "pulp_href": "/pulp/api/v3/tasks/14b76b27-9f34-4297-88ed-5ec13cbe5e50/", "reserved_resources_record": [], "started_at": null, "state": "waiting", "worker": null }
Per Brian "So the problematic thing I see in this output is the "resource-manager | 0". This tells me that Pulp's record of the task is in postgresql (and was never run), but RQ has lost the task from the "resource-manager" queue in Redis. So the next question is how did that happen?"
Related issues
Updated by daviddavis over 4 years ago
This seems to produce stuck waiting tasks for me:
export I=$RANDOM
http POST $BASE_ADDR/pulp/api/v3/remotes/file/file/ name=foo$I url=bar
export REMOTE_HREF=$(http $BASE_ADDR/pulp/api/v3/remotes/file/file/ | jq -r '.results[0] | .pulp_href')
sudo systemctl stop redis
http PATCH :$REMOTE_HREF name=test$I
sudo systemctl start redis
# this task will be stuck forever and all future tasks get stuck too
http PATCH :$REMOTE_HREF name=test
Updated by fao89 over 4 years ago
- Triaged changed from No to Yes
- Sprint set to Sprint 70
Updated by dalley over 4 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
Updated by pulpbot over 4 years ago
- Status changed from ASSIGNED to POST
Updated by dalley over 4 years ago
Pasting this one comment from github for historical record:
I might know what it is. The task is spawned onto a queue with the name of the worker they are assigned to. This worker is determined via _acquire_worker with a database query. When redis goes down, all the workers get killed off (and recreated), but their records remain in the database for 30 seconds until they get cleaned up in the background.
If it happens quickly enough, it could try to assign the task to a dead worker, and dump the task into a queue for the dead worker. That should be easy enough to verify...
This would have been introduced when we started using a random(ish?) number component in the worker names.
Updated by hinduparv over 4 years ago
- File As we grew up, my brothers acted like theydidn’t care, but I always knew theylooked out for me and were there!– Happy Raksha Bandhan!.jpg added
Updated by dalley over 4 years ago
- File deleted (
As we grew up, my brothers acted like theydidn’t care, but I always knew theylooked out for me and were there!– Happy Raksha Bandhan!.jpg)
Updated by dalley over 4 years ago
- Related to Issue #7119: Tasks stay in waiting state if worker that had resource reservation gone added
Updated by fao89 about 4 years ago
- Related to Issue #7387: Tasks not delivered to resource-manager are not cleaned up added
Updated by daviddavis about 4 years ago
- Has duplicate Issue #7387: Tasks not delivered to resource-manager are not cleaned up added
Updated by daviddavis about 4 years ago
Setting the priority to high since Ansible also reported this.
Updated by pulpbot about 4 years ago
Added by dalley about 4 years ago
Updated by dalley about 4 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulpcore|fd986b4e6e25d03274d54f8d564af0d61c892492.
Updated by pulpbot about 4 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Fix tasks stuck in waiting state on redis connection failure
closes: #6449 https://pulp.plan.io/issues/6449