Project

Profile

Help

Issue #9247

closed

"Lost" tasks are marked as "canceled", which is misleading/confusing

Added by ggainey about 3 years ago. Updated almost 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Sprint:
Sprint 106
Quarter:

Description

Currently, tasks that "disappear" are marked as 'canceled' once the task-system notices a worker has disappeared. This is (in my opinion, anyway) confusing/misleading. These tasks weren't canceled - all we know is, they failed to complete and left the system

Specific examples seen "in the wild" of the kinds of tasks where this happens:

  • Memory-critical system invokes the OOMKiller which kill-9s a running task to free up memory
  • python invokes a C-library which throws an assert, resulting in a SIGABRT that kills the worker
  • Frustrated user kill-9s worker-threads

My contention is these are 'failed' tasks - they just didn't fail via python-exception.

Discussion around this started in email; here is the record as it stands today:

1965942 (depsolving/CV-publish BZ)

  • lots of testing
  • investigating/experimenting ways to notice/catch the assert-failure from libsolv
    • dealing with fatal signals from a C library, in Python, not on the main thread, is...Fraught
  • marking "disappeared" tasks as "canceled" is feeling more and more odd to me
    • should these be marked as "failed"? We have no idea why they disappeared.

Yeah, I agree. I think it is because we are handling cancellation "lazily", i.e. we just kill the process when we want to cancel it, and treat all killed processes as cancelled. There is a special path for Python exceptions in _perform_task() that catches Python failures before they kill the process that way, so it works in that case...

But as you've pointed out, this strategy breaks for non-Python exceptions. I think that's a valid issue to file separately.

I'm unsure, whether I understand what the issue is. Can we consider the task states in a finite state machine and collect all the ways a task can transition from one state to another? Like "requested to be canceled", "raised an exception while executing", "worker gone missing", "worker was restarted". I think the "failed" state is really only used for intrinsic reasons where we can attach a traceback of the exception raised by the task. But i agree that this means we treat "aborted" tasks like "canceled" tasks, and that may not be the most intuitive way.

In this particular instance, a worker is calling a C library that is failing with an assert. This generates a SIGABRT, which kills the worker in a way that we can't notice from python - setting up signal-handlers in Python has to be done from the main thread in the interpreter (at least, that's my understanding, and my experiments appear to bear it out), and that's not where we are in the code that ultimately fails. There's no traceback, just a single line of output in the log from the assert.

"Canceled", to me, implies "active intent" - if something is canceled, it's because someone canceled it. If a thread just disappears, that isn't canceled - that's "Something Bad Happened", and we don't know what. That either deserves its own > state, or should be "failed" - we don't know what happened, all we know it wasn't "completed". Treating this as the same as a deliberate cancel isn't less intuitive - it's actively misleading.

At any rate - I'll open an issue today and get this discussion into it, better to capture in the right place!

So we should add "python interpreter died" to the state transitions.


Related issues

Copied to Pulp - Backport #9453: Backport #9247 "Lost tasks are marked as canceled, which is misleading/confusing" to 3.14.zCLOSED - CURRENTRELEASEdalley

Actions
Actions #1

Updated by fao89 about 3 years ago

  • Triaged changed from No to Yes
Actions #2

Updated by fao89 about 3 years ago

  • Triaged changed from Yes to No
Actions #3

Updated by ggainey about 3 years ago

Discussed during open-floor 2021-08-31 10:32-10:39EDT. Decision was, "mark as 'failed', with a timestamp for when we marked it so, and an error-code of 'worker disappeared'"

We will want to backport any fix to 3.14 as well, for downstream/katello-consumption.

Actions #4

Updated by dkliban@redhat.com about 3 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 104
Actions #5

Updated by rchan about 3 years ago

  • Sprint changed from Sprint 104 to Sprint 105
Actions #6

Updated by pulpbot almost 3 years ago

  • Status changed from NEW to POST
Actions #7

Updated by dalley almost 3 years ago

  • Assignee set to ggainey
  • Sprint/Milestone set to 3.16.0
Actions #8

Updated by dalley almost 3 years ago

  • Assignee deleted (ggainey)
  • Sprint/Milestone deleted (3.16.0)
Actions #9

Updated by ggainey almost 3 years ago

  • Status changed from POST to NEW

Incorrectly linked to wrong PR - fixing.

Actions #10

Updated by rchan almost 3 years ago

  • Sprint changed from Sprint 105 to Sprint 106
Actions #11

Updated by mdellweg almost 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to mdellweg
Actions #12

Updated by pulpbot almost 3 years ago

  • Status changed from ASSIGNED to POST
Actions #13

Updated by dalley almost 3 years ago

  • Sprint/Milestone set to 3.16.0
Actions #14

Updated by dalley almost 3 years ago

  • Copied to Backport #9453: Backport #9247 "Lost tasks are marked as canceled, which is misleading/confusing" to 3.14.z added

Added by mdellweg almost 3 years ago

Revision 6fb91091 | View on GitHub

Mark abandoned tasks as "failed"

fixes #9247

Actions #15

Updated by mdellweg almost 3 years ago

  • Status changed from POST to MODIFIED
Actions #16

Updated by pulpbot almost 3 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF