Project

Profile

Help

Issue #1698

closed

TaskStatus start/end times are wrong because we call celery tasks synchronously (as a function) from within another celery task

Added by bmbouter almost 7 years ago. Updated almost 4 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
1. Low
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

We've had bugs before where task status timestamps and updates were not working in strange ways. At that time we learned that the issue was the TaskStatus was using an unexpected task_id and updating the wrong TaskStatus. The unexpected case occurred when we call a celery task synchronously as a function from within another celery task. I filed this issue at that time in Celery upstream[0], and there isn't a clear resolution. For our purposes, I had intended that we wouldn't perform that call pattern anymore, but I found at least one case in our code that does it. Here is a traceback which shows that:

celery.worker.job:INFO: Task pulp.server.async.tasks._queue_reserved_task[5056dd3d-3657-404a-a058-a97056cc1df9] succeeded in 0.0640279349973s: None
pulp.server.controllers.repository:ERROR: (14502-62656) Error received removing distributor [56c620fb266b9c36e8190bf1] from repo [433cafcf-66e5-4f4c-aa1b-53f739fdd4de]
pulp.server.controllers.repository:ERROR: (14502-62656) Traceback (most recent call last):
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp/server/controllers/repository.py", line 515, in delete
pulp.server.controllers.repository:ERROR: (14502-62656)     dist_controller.delete(distributor.repo_id, distributor.distributor_id)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/celery/local.py", line 167, in <lambda>
pulp.server.controllers.repository:ERROR: (14502-62656)     __call__ = lambda x, *a, **kw: x._get_current_object()(*a, **kw)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 473, in __call__
pulp.server.controllers.repository:ERROR: (14502-62656)     return super(Task, self).__call__(*args, **kwargs)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp/server/async/tasks.py", line 103, in __call__
pulp.server.controllers.repository:ERROR: (14502-62656)     return super(PulpTask, self).__call__(*args, **kwargs)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/celery/app/trace.py", line 438, in __protected_call__
pulp.server.controllers.repository:ERROR: (14502-62656)     return orig(self, *args, **kwargs)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/celery/app/task.py", line 420, in __call__
pulp.server.controllers.repository:ERROR: (14502-62656)     return self.run(*args, **kwargs)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp/server/controllers/distributor.py", line 133, in delete
pulp.server.controllers.repository:ERROR: (14502-62656)     dist_instance.distributor_removed(repo.to_transfer_repo(), call_config)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp/plugins/file/distributor.py", line 55, in distributor_removed
pulp.server.controllers.repository:ERROR: (14502-62656)     self.unpublish_repo(repo, config)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp_rpm/plugins/distributors/iso_distributor/distributor.py", line 76, in unpublish_repo
pulp.server.controllers.repository:ERROR: (14502-62656)     publish.remove_repository_protection(transfer_repo.repo_obj)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp_rpm/plugins/distributors/iso_distributor/publish.py", line 78, in remove_repository_protection
pulp.server.controllers.repository:ERROR: (14502-62656)     protected_repo_utils.delete_protected_repo(relative_path)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp/repoauth/protected_repo_utils.py", line 64, in delete_protected_repo
pulp.server.controllers.repository:ERROR: (14502-62656)     f.remove_protected_repo_path(repo_relative_path)
pulp.server.controllers.repository:ERROR: (14502-62656)   File "/usr/lib/python2.6/site-packages/pulp/repoauth/protected_repo_utils.py", line 179, in remove_protected_repo_path

[0]: https://github.com/celery/celery/issues/2494


Files

output.txt (1.17 MB) output.txt bmbouter, 03/22/2016 06:51 PM
Actions #1

Updated by bmbouter almost 7 years ago

We are in a good position to add a guard logic to the PulpTask.__call__ method. That method is called for each celery function/object called (both asynchronously and synchronously) so it should be something we can know.

Actions #2

Updated by mhrivnak almost 7 years ago

Can you explain the traceback a bit more? What action caused it? What exception was raised?

It would be helpful also if you just reference the two tasks directly. "Task pulp.foo is calling task pulp.bar synchronously." And if you can also elaborate on the impact, that would be useful. "Task pulp.foo shows an incorrect start|end time in a case where..."

Actions #3

Updated by bmbouter almost 7 years ago

To clarify, this isn't referring to a specific task because it's likely wrong in many places.

The traceback was given to me by another developer and I observed this call pattern which I previously identified as being unsafe. That issue has been resolved, but during investigation I observed this unhealthy call pattern which is indicative of a bug I've solved before.

I just searched for that other bug, but I couldn't not find it. I can say that we likely have bugs where our TaskStatus updates are not occurring the way we think they are due to the task_id being wrong. The idea of the fix is to put a guard in place which raises an exception and let Pulp Smash find them.

Actions #4

Updated by mhrivnak almost 7 years ago

I'm still uncertain of the scope of the problem. In what way will timestamps be wrong? Missing? Off by a few seconds? Off by years? Finished before it started? Or are we not sure?

I appreciate that the problem is inherently somewhat vague at this point, but I think it would be very helpful to stick as closely as you reasonably can to the fundamental bug report form of:

"If you do X, Y, and Z, you expect to see this behavior, but instead see that other behavior."

If the report currently really does have all the information we know, then I suggest converting it to a task to investigate possible bugs in this area.

Actions #5

Updated by bmbouter almost 7 years ago

The timestamps will be wrong in two specific ways. For example terminology, there is an "outer task" which synchronously calls an "inner task". When the "inner task" is called it resets the start_time, overwriting the correct start time. Then, when the inner task completes it marks the issue as complete even though the outer task is still running. That update also sets the finished time stamp too early which eventually does get overwritten when the outer task completes.

Also the issue isn't that it's vague its that it's broad.

Actions #6

Updated by mhrivnak almost 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
  • Priority changed from Normal to High
Actions #7

Updated by bmbouter almost 7 years ago

I determined that there are 4 calls which are called synchronously from an asynchronous task.

<@task: pulp.server.controllers.repository.queue_publish of tasks
<@task: pulp.server.managers.repo.importer.remove_importer of tasks
<@task: pulp.server.tasks.repository.distributor_delete of tasks
<@task: pulp.server.tasks.repository.distributor_update of tasks
Actions #8

Updated by bmbouter almost 7 years ago

The attachment contains tracebacks generated from this branch of code. It logs them as errors. Those callstacks need to be adjusted and then guard code like this can be added to raise a runtime exception when this occurs. https://github.com/pulp/pulp/compare/master...bmbouter:1698-Cyser?expand=1

Actions #9

Updated by bmbouter almost 7 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (bmbouter)
Actions #10

Updated by mhrivnak almost 7 years ago

  • Priority changed from High to Normal
  • Triaged changed from No to Yes
Actions #11

Updated by bmbouter almost 4 years ago

  • Status changed from NEW to CLOSED - WONTFIX
Actions #12

Updated by bmbouter almost 4 years ago

Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.

Actions #13

Updated by bmbouter almost 4 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF