Maintenance: Planio will be observing a scheduled maintenance window this Sunday, November 10, 2024 from 20:00 UTC until 21:00 UTC to perform important network maintenance in our primary data center. Your Planio account will be unavailable for a few minutes during this maintenance window.
Issue #8779
closedIssue #8912: [EPIC] Issues with the traditional tasking system
Task started on removed worker
Description
After a postgres outage a couple of tasks were started on workers which logs claim to have been removed. The tasks then got stuck in a 'waiting' state and I had to cancel them to make them go away.
Logs showing worker being removed:
May 19 10:56:34 lxserv2285 rq[589125]: pulp [None]: pulpcore.tasking.worker_watcher:ERROR: Worker '2961917@lxserv2285' has gone missing, removing from list of workers
May 19 10:56:34 lxserv2285 rq[589125]: pulp [None]: pulpcore.tasking.worker_watcher:ERROR: The worker named 2961917@lxserv2285 is missing. Canceling the tasks in its queue.
Task being started after removal of workers (snippet):
{ "pulp_created": "2021-05-19T10:57:17.819927Z", "state": "waiting",
"worker": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/",
}
The worker above is the worker that had been removed.
Any idea why the task was started on a worker that should have been removed from the list of workers?
On RHEL8, with python3-pulpcore-3.11.0-1.el8.noarch
Related issues
Updated by adam.winberg@smhi.se over 3 years ago
Sorry, the task output was not complete
{
"child_tasks": [],
"created_resources": [],
"error": null,
"finished_at": null,
"logging_cid": "565b8b02bed8467997bc7d1f1a7440e4",
"name": "pulp_rpm.app.tasks.synchronizing.synchronize",
"parent_task": null,
"progress_reports": [],
"pulp_created": "2021-05-19T10:57:23.076422Z",
"pulp_href": "/pulp/api/v3/tasks/492ad82e-a55f-46fe-94ad-e017f378d162/",
"reserved_resources_record": [
"/pulp/api/v3/remotes/rpm/rpm/3e276c7e-dbf7-4860-b6b1-965dfd188039/",
"/pulp/api/v3/repositories/rpm/rpm/bed5516d-50ab-4c20-889b-747470e88551/"
],
"started_at": null,
"state": "waiting",
"task_group": null,
"worker": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/"
}
And also the worker details.
{
"last_heartbeat": "2021-05-19T10:56:34.675050Z",
"name": "2961917@lxserv2285.smhi.se",
"pulp_created": "2021-05-03T05:43:18.304676Z",
"pulp_href": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/"
}
So, since this worker was removed at 10:56:34 how could a task be assigned to it at 10:57:23?
Updated by dkliban@redhat.com over 3 years ago
- Priority changed from Normal to High
- Triaged changed from No to Yes
- Sprint set to Sprint 98
Updated by bmbouter over 3 years ago
- Status changed from NEW to CLOSED - WONTFIX
Users are recommended to upgrade to the new tasking system as the resolution to this issue. If someone is not able to do that, please comment here with information on why we should reopen the issue.
Updated by ggainey over 3 years ago
- Status changed from CLOSED - WONTFIX to NEW
Reopening - this is considered a blocker for katello/3.18 upgrade-to-pulp3 process, as it has been causing the 2to3 tasks to hang forever. See 1975858 for more details.
Updated by ggainey over 3 years ago
- Priority changed from High to Normal
- Severity changed from 3. High to 2. Medium
ggainey wrote:
Reopening - this is considered a blocker for katello/3.18 upgrade-to-pulp3 process, as it has been causing the 2to3 tasks to hang forever. See 1975858 for more details.
After discussion with katello, here's where we are on this:
- the current "concerning" reports are only from users experimenting with Pulp3.7/katello3.18, post-2to3-migration
- this is not ever going to be a supported environment for major Pulp3 work
- current sequence is pulp3.7/katello3.18/pulp2.21.5, migrate your data using 2to3 migration, upgrade to pulp3.14/katello4.1/no-pulp2 and then do Complicated Stuff
-
if this problem occurs during 2to3 migration, there is a workaround
- cancel the task(s) and restart migration
- disable parallel-processing of 2to3 if needed
As a result, the priority on this goes way down - this may not be worth fixing for 3.7, and is not a problem in 3.14.
Updated by pulpbot over 3 years ago
- Status changed from NEW to POST
Updated by mdellweg over 3 years ago
- Copied to Backport #9116: Backport #8779 "Task started on removed worker" to 3.14.z added
Updated by ttereshc over 3 years ago
- Related to Backport #9118: Backport #8779 "Task started on removed worker" fix to 3.7 added
Added by mdellweg over 3 years ago
Updated by mdellweg over 3 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulpcore|0cfaa8e7433b7cd272631a6f51b9f4a7b10224a7.
Updated by pulpbot about 3 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Prevent tasks being assigned to missing workers
fixes #8779