Project

Profile

Help

Issue #8779

Issue #8912: [EPIC] Issues with the traditional tasking system

Task started on removed worker

Added by adam.winberg@smhi.se 2 months ago. Updated 7 days ago.

Status:
MODIFIED
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Katello
Sprint:
Sprint 101
Quarter:

Description

After a postgres outage a couple of tasks were started on workers which logs claim to have been removed. The tasks then got stuck in a 'waiting' state and I had to cancel them to make them go away.

Logs showing worker being removed:

May 19 10:56:34 lxserv2285 rq[589125]: pulp [None]: pulpcore.tasking.worker_watcher:ERROR: Worker '2961917@lxserv2285' has gone missing, removing from list of workers
May 19 10:56:34 lxserv2285 rq[589125]: pulp [None]: pulpcore.tasking.worker_watcher:ERROR: The worker named 2961917@lxserv2285 is missing. Canceling the tasks in its queue.

Task being started after removal of workers (snippet):

{                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "pulp_created": "2021-05-19T10:57:17.819927Z",                                                                                                                                                                                                                                  "state": "waiting",                                                                                                                                                                                                                                                             
  "worker": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/",
  }

The worker above is the worker that had been removed.

Any idea why the task was started on a worker that should have been removed from the list of workers?

On RHEL8, with python3-pulpcore-3.11.0-1.el8.noarch


Related issues

Related to Pulp - Backport #9118: Backport #8779 "Task started on removed worker" fix to 3.7 CLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Copied to Pulp - Backport #9116: Backport #8779 "Task started on removed worker" to 3.14.zCLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision 0cfaa8e7 View on GitHub
Added by mdellweg 7 days ago

Prevent tasks being assigned to missing workers

fixes #8779

History

#1 Updated by adam.winberg@smhi.se 2 months ago

Sorry, the task output was not complete

{
    "child_tasks": [],
    "created_resources": [],
    "error": null,
    "finished_at": null,
    "logging_cid": "565b8b02bed8467997bc7d1f1a7440e4",
    "name": "pulp_rpm.app.tasks.synchronizing.synchronize",
    "parent_task": null,
    "progress_reports": [],
    "pulp_created": "2021-05-19T10:57:23.076422Z",
    "pulp_href": "/pulp/api/v3/tasks/492ad82e-a55f-46fe-94ad-e017f378d162/",
    "reserved_resources_record": [
        "/pulp/api/v3/remotes/rpm/rpm/3e276c7e-dbf7-4860-b6b1-965dfd188039/",
        "/pulp/api/v3/repositories/rpm/rpm/bed5516d-50ab-4c20-889b-747470e88551/"
    ],
    "started_at": null,
    "state": "waiting",
    "task_group": null,
    "worker": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/"
}

And also the worker details.

{
    "last_heartbeat": "2021-05-19T10:56:34.675050Z",
    "name": "2961917@lxserv2285.smhi.se",
    "pulp_created": "2021-05-03T05:43:18.304676Z",
    "pulp_href": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/"
}

So, since this worker was removed at 10:56:34 how could a task be assigned to it at 10:57:23?

#3 Updated by ttereshc about 2 months ago

  • Tags Katello added

#4 Updated by dkliban@redhat.com about 2 months ago

  • Priority changed from Normal to High
  • Triaged changed from No to Yes
  • Sprint set to Sprint 98

#5 Updated by dalley about 2 months ago

  • Severity changed from 2. Medium to 3. High

#6 Updated by mdellweg about 1 month ago

  • Parent task set to #8912

#7 Updated by rchan about 1 month ago

  • Sprint changed from Sprint 98 to Sprint 99

#8 Updated by rchan 27 days ago

  • Sprint changed from Sprint 99 to Sprint 100

#9 Updated by bmbouter 23 days ago

  • Status changed from NEW to CLOSED - WONTFIX

Users are recommended to upgrade to the new tasking system as the resolution to this issue. If someone is not able to do that, please comment here with information on why we should reopen the issue.

#10 Updated by ggainey 16 days ago

  • Status changed from CLOSED - WONTFIX to NEW

Reopening - this is considered a blocker for katello/3.18 upgrade-to-pulp3 process, as it has been causing the 2to3 tasks to hang forever. See 1975858 for more details.

#11 Updated by ggainey 15 days ago

  • Priority changed from High to Normal
  • Severity changed from 3. High to 2. Medium

ggainey wrote:

Reopening - this is considered a blocker for katello/3.18 upgrade-to-pulp3 process, as it has been causing the 2to3 tasks to hang forever. See 1975858 for more details.

After discussion with katello, here's where we are on this:

  • the current "concerning" reports are only from users experimenting with Pulp3.7/katello3.18, post-2to3-migration
  • this is not ever going to be a supported environment for major Pulp3 work
    • current sequence is pulp3.7/katello3.18/pulp2.21.5, migrate your data using 2to3 migration, upgrade to pulp3.14/katello4.1/no-pulp2 and then do Complicated Stuff
  • if this problem occurs during 2to3 migration, there is a workaround
    • cancel the task(s) and restart migration
    • disable parallel-processing of 2to3 if needed

As a result, the priority on this goes way down - this may not be worth fixing for 3.7, and is not a problem in 3.14.

#12 Updated by rchan 14 days ago

  • Sprint changed from Sprint 100 to Sprint 101

#13 Updated by pulpbot 10 days ago

  • Status changed from NEW to POST

#14 Updated by mdellweg 9 days ago

  • Copied to Backport #9116: Backport #8779 "Task started on removed worker" to 3.14.z added

#15 Updated by ttereshc 9 days ago

  • Related to Backport #9118: Backport #8779 "Task started on removed worker" fix to 3.7 added

#17 Updated by mdellweg 7 days ago

  • Assignee set to mdellweg

#18 Updated by mdellweg 7 days ago

  • Status changed from POST to MODIFIED

Please register to edit this issue

Also available in: Atom PDF