Project

Profile

Help

Issue #2990

Celery 4.x traceback upon restarting a worker while processing a task

Added by dalley 3 months ago. Updated 3 months ago.

Status:
NEW
Priority:
Low
Assignee:
-
Sprint/Milestone:
-
Severity:
1. Low
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No

Description

When using Celery 4.x with either RabbitMQ or Qpid, a traceback occurs when restarting pulp_workers during the execution of a task.

  • Run a task (e.g. repo sync)
  • In another terminal, run "sudo systemctl restart pulp_workers" while the task is still ongoing
  • Observe the following traceback when the worker restarts
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720) Traceback (most recent call last):
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 374, in trace_task
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)     R = retval = fun(*args, **kwargs)
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 107, in __call__
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)     return super(PulpTask, self).__call__(*args, **kwargs)
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)   File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 629, in __protected_call__
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)     return self.run(*args, **kwargs)
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 296, in _release_resource
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)     new_task.on_failure(exception, task_id, (), {}, MyEinfo)
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)   File "/home/vagrant/devel/pulp/server/pulp/server/async/tasks.py", line 595, in on_failure
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)     if not self.request.called_directly:
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)   File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 980, in _get_request
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720)     req = self.request_stack.top
Aug 16 22:52:19 pulp2.dev pulp[3057]: celery.app.trace:ERROR: [934638d3] (3057-94720) AttributeError: 'NoneType' object has no attribute 'top'

This error does not occur when using Celery 3.1.x with RabbitMQ or Qpid, despite the code being essentially identical. The order of some events must have changed such that "self.request_stack" is not being initialized early enough.

https://github.com/celery/celery/blob/master/celery/app/task.py#L980

This is triggered by the line "if not self.request.called_directly" in our own code here: https://github.com/pulp/pulp/blob/master/server/pulp/server/async/tasks.py#L600

"request.called_directly" is a property,, and the failure occurs within the backing code.

History

#1 Updated by ttereshc 3 months ago

  • Priority changed from Normal to Low
  • Severity changed from 2. Medium to 1. Low
  • Triaged changed from No to Yes

@dalley will leave a comment describing impact/consequences of this issue.

#2 Updated by dalley 3 months ago

The impact is that the code below that line in the "on_failure()" callback will fail to run.

 if not self.request.called_directly:
            now = datetime.now(dateutils.utc_tz())
            finish_time = dateutils.format_iso8601_datetime(now)
            task_status = TaskStatus.objects.get(task_id=task_id)
            task_status['state'] = constants.CALL_ERROR_STATE
            task_status['finish_time'] = finish_time
            task_status['traceback'] = einfo.traceback
            if not isinstance(exc, PulpException):
                exc = PulpException(str(exc))
            task_status['error'] = exc.to_dict()
            task_status.save()
            self._handle_cProfile(task_id)
            common_utils.delete_working_directory()

However, this isn't a critical failure because the task gets "canceled" and cleaned up when the worker restarts. Running "pulp-admin tasks list" afterwards displays all the correct information.

Please register to edit this issue

Also available in: Atom PDF