Project

Profile

Help

Issue #5502

closed

worker fails to mark a task as failed when critical conditions are encountered

Added by mihai.ibanescu@gmail.com over 4 years ago. Updated almost 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
2.15.3
Platform Release:
2.21.5
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 87
Quarter:

Description

In my case, I was syncing a docker repo and I ran out of space on the partition that had /var/cache/pulp/

Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: pulp.server.async.tasks:INFO: [a121ac15] Task failed : [fa40464f-ad88-4f7a-913a-1cbd498e298e] : Worker terminated abnormally while processing task fa40464f-ad88-4f7a-913a-1cbd498e298e.  Check the logs for details
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816) Task pulp.server.async.tasks._release_resource[a121ac15-1698-4628-8322-fa2e3543bdda] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816) Traceback (most recent call last):
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     R = retval = fun(*args, **kwargs)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 107, in __call__
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     return super(PulpTask, self).__call__(*args, **kwargs)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     return self.run(*args, **kwargs)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 296, in _release_resource
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 602, in on_failure
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     if not self.request.called_directly:
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 978, in _get_request
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     req = self.request_stack.top
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816) AttributeError: 'NoneType' object has no attribute 'top'

This looks similar to https://pulp.plan.io/issues/2849 but I've applied that fix already.

I believe the problem with not canceling the whole task (and generating the traceback above) is at https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L376 (while I'm on 2.15.3, the bug seems to still be in the 2-master branch).

new_task was never bound to an app, so its request_stack was never initialized.

If that is indeed the case, maybe at the same time we can also make the inner class at line 373 be moved higher so it doesn't have to be re-generated with every call - although in this case, if you're in this code path, things are kind of bad anyway).

Actions #1

Updated by fao89 over 4 years ago

  • Sprint/Milestone set to 2.20.0
  • Triaged changed from No to Yes
Actions #2

Updated by dalley over 4 years ago

I'm pretty sure this broke after we moved to Celery 4.0.2: https://pulp.plan.io/issues/2990

I think my judgement at the time about whether that issue was likely to present a problem was flawed. It should probably be fixed. Would you be willing to look into it, Mihai?

Actions #3

Updated by dkliban@redhat.com over 3 years ago

  • Status changed from NEW to CLOSED - WONTFIX
  • Tags Pulp 2 added

Pulp 2 is in maintenance mode.

Actions #4

Updated by bmbouter over 3 years ago

  • Status changed from CLOSED - WONTFIX to NEW
  • Sprint/Milestone deleted (2.20.0)
Actions #5

Updated by bmbouter over 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
  • Triaged changed from Yes to No
Actions #6

Updated by bmbouter over 3 years ago

untriaging so we can put on the sprint at triage

Actions #7

Updated by fao89 over 3 years ago

  • Triaged changed from No to Yes
  • Sprint set to Sprint 86
Actions #8

Updated by bmbouter over 3 years ago

  • Status changed from ASSIGNED to POST
Actions #9

Updated by rchan over 3 years ago

  • Sprint changed from Sprint 86 to Sprint 87
Actions #10

Updated by ipanova@redhat.com over 3 years ago

  • Platform Release set to 2.21.5

Added by bmbouter over 3 years ago

Revision 8e69ab19 | View on GitHub

Adds task failure handling for celery in bad state

Sometimes celery's internal state is in such a bad position it causes the on_failure handler itself to fail. This is very bad for Pulp as it causes the tasking system to stop processing tasks due to on_failure not releasing locks like it should.

This adds a defensive codepath for that situation which maintains the current functionality, but if Celery is uncooperative it still performs the on_failure cleanup.

Additionally, it logs the error in all cases. One of the aspects of this issue is that due to Celery having a bad state it won't correctly log the actual root cause triggering the on_failure handler.

fixes #5502

Actions #11

Updated by bmbouter over 3 years ago

  • Status changed from POST to MODIFIED

Added by bmbouter about 3 years ago

Revision ef85981e | View on GitHub

Adds task failure handling for celery in bad state

Sometimes celery's internal state is in such a bad position it causes the on_failure handler itself to fail. This is very bad for Pulp as it causes the tasking system to stop processing tasks due to on_failure not releasing locks like it should.

This adds a defensive codepath for that situation which maintains the current functionality, but if Celery is uncooperative it still performs the on_failure cleanup.

Additionally, it logs the error in all cases. One of the aspects of this issue is that due to Celery having a bad state it won't correctly log the actual root cause triggering the on_failure handler.

fixes #5502

(cherry picked from commit 8e69ab197f68b0b8a701172d8057d360c8a72dc0)

Actions #12

Updated by ggainey about 3 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Also available in: Atom PDF