Issue #5502: worker fails to mark a task as failed when critical conditions are encountered - Pulp

Actions

Send by e-mail Copy link

Issue #5502

closed

worker fails to mark a task as failed when critical conditions are encountered

Added by mihai.ibanescu@gmail.com over 5 years ago. Updated over 3 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

bmbouter

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

2.15.3

Platform Release:

2.21.5

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Sprint 87

Quarter:

Description

In my case, I was syncing a docker repo and I ran out of space on the partition that had /var/cache/pulp/

Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: pulp.server.async.tasks:INFO: [a121ac15] Task failed : [fa40464f-ad88-4f7a-913a-1cbd498e298e] : Worker terminated abnormally while processing task fa40464f-ad88-4f7a-913a-1cbd498e298e.  Check the logs for details
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816) Task pulp.server.async.tasks._release_resource[a121ac15-1698-4628-8322-fa2e3543bdda] raised unexpected: AttributeError("'NoneType' object has no attribute 'top'",)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816) Traceback (most recent call last):
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     R = retval = fun(*args, **kwargs)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 107, in __call__
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     return super(PulpTask, self).__call__(*args, **kwargs)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     return self.run(*args, **kwargs)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 296, in _release_resource
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/pulp/server/async/tasks.py", line 602, in on_failure
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     if not self.request.called_directly:
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 978, in _get_request
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816)     req = self.request_stack.top
Sep 25 15:54:31 pulptest11.unx.sas.com pulp[13366]: celery.app.trace:ERROR: [a121ac15] (13366-14816) AttributeError: 'NoneType' object has no attribute 'top'

This looks similar to https://pulp.plan.io/issues/2849 but I've applied that fix already.

I believe the problem with not canceling the whole task (and generating the traceback above) is at https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/tasks.py#L376 (while I'm on 2.15.3, the bug seems to still be in the 2-master branch).

new_task was never bound to an app, so its request_stack was never initialized.

If that is indeed the case, maybe at the same time we can also make the inner class at line 373 be moved higher so it doesn't have to be re-generated with every call - although in this case, if you're in this code path, things are kind of bad anyway).

Actions

Copy link

Updated by fao89 over 5 years ago

Sprint/Milestone set to 2.20.0
Triaged changed from No to Yes

Actions

Copy link

Updated by dalley over 5 years ago

I'm pretty sure this broke after we moved to Celery 4.0.2: https://pulp.plan.io/issues/2990

I think my judgement at the time about whether that issue was likely to present a problem was flawed. It should probably be fixed. Would you be willing to look into it, Mihai?

Actions

Copy link

Updated by dkliban@redhat.com over 4 years ago

Status changed from NEW to CLOSED - WONTFIX
Tags Pulp 2 added

Pulp 2 is in maintenance mode.

Actions

Copy link

Updated by bmbouter about 4 years ago

Status changed from CLOSED - WONTFIX to NEW
Sprint/Milestone deleted (~~2.20.0~~)

Actions

Copy link

Updated by bmbouter about 4 years ago

Status changed from NEW to ASSIGNED
Assignee set to bmbouter
Triaged changed from Yes to No

Actions

Copy link

Updated by bmbouter about 4 years ago

untriaging so we can put on the sprint at triage

Actions

Copy link

Updated by fao89 about 4 years ago

Triaged changed from No to Yes
Sprint set to Sprint 86

Actions

Copy link

Updated by bmbouter about 4 years ago

Status changed from ASSIGNED to POST

PR available: https://github.com/pulp/pulp/pull/4009

Actions

Copy link

Updated by rchan about 4 years ago

Sprint changed from Sprint 86 to Sprint 87

Actions

Copy link

#10

Updated by ipanova@redhat.com about 4 years ago

Platform Release set to 2.21.5

Added by bmbouter about 4 years ago

Revision 8e69ab19 | View on GitHub

Adds task failure handling for celery in bad state

Sometimes celery's internal state is in such a bad position it causes the on_failure handler itself to fail. This is very bad for Pulp as it causes the tasking system to stop processing tasks due to on_failure not releasing locks like it should.

This adds a defensive codepath for that situation which maintains the current functionality, but if Celery is uncooperative it still performs the on_failure cleanup.

Additionally, it logs the error in all cases. One of the aspects of this issue is that due to Celery having a bad state it won't correctly log the actual root cause triggering the on_failure handler.

fixes #5502

Actions

Copy link

#11

Updated by bmbouter about 4 years ago

Status changed from POST to MODIFIED

Applied in changeset pulp|8e69ab197f68b0b8a701172d8057d360c8a72dc0.

Added by bmbouter about 4 years ago

Revision ef85981e | View on GitHub

Adds task failure handling for celery in bad state

This adds a defensive codepath for that situation which maintains the current functionality, but if Celery is uncooperative it still performs the on_failure cleanup.

fixes #5502

(cherry picked from commit 8e69ab197f68b0b8a701172d8057d360c8a72dc0)

Actions

Copy link

#12

Updated by ggainey almost 4 years ago

Status changed from MODIFIED to CLOSED - CURRENTRELEASE

Actions

Send by e-mail Copy link

Also available in: Atom PDF

Project

Profile

Help

Pulp

Agile boards

Custom queries

Issue #5502

worker fails to mark a task as failed when critical conditions are encountered

Updated by fao89 over 5 years ago

Updated by dalley over 5 years ago

Updated by dkliban@redhat.com over 4 years ago

Updated by bmbouter about 4 years ago

Updated by bmbouter about 4 years ago

Updated by bmbouter about 4 years ago

Updated by fao89 about 4 years ago

Updated by bmbouter about 4 years ago

Updated by rchan about 4 years ago

Updated by ipanova@redhat.com about 4 years ago

Added by bmbouter about 4 years ago

Updated by bmbouter about 4 years ago

Added by bmbouter about 4 years ago

Updated by ggainey almost 4 years ago