Project

Profile

Help

Issue #8779

closed

Issue #8912: [EPIC] Issues with the traditional tasking system

Task started on removed worker

Added by adam.winberg@smhi.se over 3 years ago. Updated about 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Katello
Sprint:
Sprint 101
Quarter:

Description

After a postgres outage a couple of tasks were started on workers which logs claim to have been removed. The tasks then got stuck in a 'waiting' state and I had to cancel them to make them go away.

Logs showing worker being removed:

May 19 10:56:34 lxserv2285 rq[589125]: pulp [None]: pulpcore.tasking.worker_watcher:ERROR: Worker '2961917@lxserv2285' has gone missing, removing from list of workers
May 19 10:56:34 lxserv2285 rq[589125]: pulp [None]: pulpcore.tasking.worker_watcher:ERROR: The worker named 2961917@lxserv2285 is missing. Canceling the tasks in its queue.

Task being started after removal of workers (snippet):

{                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "pulp_created": "2021-05-19T10:57:17.819927Z",                                                                                                                                                                                                                                  "state": "waiting",                                                                                                                                                                                                                                                             
  "worker": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/",
  }

The worker above is the worker that had been removed.

Any idea why the task was started on a worker that should have been removed from the list of workers?

On RHEL8, with python3-pulpcore-3.11.0-1.el8.noarch


Related issues

Related to Pulp - Backport #9118: Backport #8779 "Task started on removed worker" fix to 3.7 CLOSED - CURRENTRELEASEmdellweg

Actions
Copied to Pulp - Backport #9116: Backport #8779 "Task started on removed worker" to 3.14.zCLOSED - CURRENTRELEASEmdellweg

Actions
Actions #1

Updated by adam.winberg@smhi.se over 3 years ago

Sorry, the task output was not complete

{
    "child_tasks": [],
    "created_resources": [],
    "error": null,
    "finished_at": null,
    "logging_cid": "565b8b02bed8467997bc7d1f1a7440e4",
    "name": "pulp_rpm.app.tasks.synchronizing.synchronize",
    "parent_task": null,
    "progress_reports": [],
    "pulp_created": "2021-05-19T10:57:23.076422Z",
    "pulp_href": "/pulp/api/v3/tasks/492ad82e-a55f-46fe-94ad-e017f378d162/",
    "reserved_resources_record": [
        "/pulp/api/v3/remotes/rpm/rpm/3e276c7e-dbf7-4860-b6b1-965dfd188039/",
        "/pulp/api/v3/repositories/rpm/rpm/bed5516d-50ab-4c20-889b-747470e88551/"
    ],
    "started_at": null,
    "state": "waiting",
    "task_group": null,
    "worker": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/"
}

And also the worker details.

{
    "last_heartbeat": "2021-05-19T10:56:34.675050Z",
    "name": "2961917@lxserv2285.smhi.se",
    "pulp_created": "2021-05-03T05:43:18.304676Z",
    "pulp_href": "/pulp/api/v3/workers/4d159eb5-01e4-4750-a921-c5b28c411e4a/"
}

So, since this worker was removed at 10:56:34 how could a task be assigned to it at 10:57:23?

Actions #3

Updated by ttereshc over 3 years ago

  • Tags Katello added
Actions #4

Updated by dkliban@redhat.com over 3 years ago

  • Priority changed from Normal to High
  • Triaged changed from No to Yes
  • Sprint set to Sprint 98
Actions #5

Updated by dalley over 3 years ago

  • Severity changed from 2. Medium to 3. High
Actions #6

Updated by mdellweg over 3 years ago

  • Parent issue set to #8912
Actions #7

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 98 to Sprint 99
Actions #8

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 99 to Sprint 100
Actions #9

Updated by bmbouter over 3 years ago

  • Status changed from NEW to CLOSED - WONTFIX

Users are recommended to upgrade to the new tasking system as the resolution to this issue. If someone is not able to do that, please comment here with information on why we should reopen the issue.

Actions #10

Updated by ggainey over 3 years ago

  • Status changed from CLOSED - WONTFIX to NEW

Reopening - this is considered a blocker for katello/3.18 upgrade-to-pulp3 process, as it has been causing the 2to3 tasks to hang forever. See 1975858 for more details.

Actions #11

Updated by ggainey over 3 years ago

  • Priority changed from High to Normal
  • Severity changed from 3. High to 2. Medium

ggainey wrote:

Reopening - this is considered a blocker for katello/3.18 upgrade-to-pulp3 process, as it has been causing the 2to3 tasks to hang forever. See 1975858 for more details.

After discussion with katello, here's where we are on this:

  • the current "concerning" reports are only from users experimenting with Pulp3.7/katello3.18, post-2to3-migration
  • this is not ever going to be a supported environment for major Pulp3 work
    • current sequence is pulp3.7/katello3.18/pulp2.21.5, migrate your data using 2to3 migration, upgrade to pulp3.14/katello4.1/no-pulp2 and then do Complicated Stuff
  • if this problem occurs during 2to3 migration, there is a workaround
    • cancel the task(s) and restart migration
    • disable parallel-processing of 2to3 if needed

As a result, the priority on this goes way down - this may not be worth fixing for 3.7, and is not a problem in 3.14.

Actions #12

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 100 to Sprint 101
Actions #13

Updated by pulpbot over 3 years ago

  • Status changed from NEW to POST
Actions #14

Updated by mdellweg over 3 years ago

  • Copied to Backport #9116: Backport #8779 "Task started on removed worker" to 3.14.z added
Actions #15

Updated by ttereshc over 3 years ago

  • Related to Backport #9118: Backport #8779 "Task started on removed worker" fix to 3.7 added
Actions #17

Updated by mdellweg over 3 years ago

  • Assignee set to mdellweg

Added by mdellweg over 3 years ago

Revision 0cfaa8e7 | View on GitHub

Prevent tasks being assigned to missing workers

fixes #8779

Actions #18

Updated by mdellweg over 3 years ago

  • Status changed from POST to MODIFIED
Actions #19

Updated by pulpbot about 3 years ago

  • Sprint/Milestone set to 3.15.0
Actions #20

Updated by pulpbot about 3 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF