Task #2527
closedEnsure Pulp 2.y works with both python-celery 3.1.z and 4.0.z
100%
Description
Problem¶
python-celery was recently upgrade from 3.1.20-3 to 4.0.2-1 on Fedora rawhide [0]. Celery 4 has some backwards incompatible changes that affect Pulp's usage of celery. This means that until we update Pulp 2.y to work with Celery 4.0.z Pulp will not be able to work with rawhide (and F26). There are at least 2 problems, but there may be more.
- Tasks with reservation won't work because the the dedicated queue name was changed from C.dq (celery 3.1.z) to C.dq2 (celery 4.0). That variable is set here.
- A undocumented API change that dalley discovered and fixed along the way
- The conversion of the two tasks in Comment 11 which need to be converted from @classmethod to @staticmethod to work with both Celery 3.1 and Celery4.
Solution¶
Pulp3 already works with celery 4.0.z, however it is unlikely to make the F26 beta freeze date of 4-28. So in the interim we need to update pulp2 to work with python-celery 4.0.z It also has to maintain compatibility with python-celery 3.1.z which will continue to be used on EL7, and F25-.
Related issues
Updated by bmbouter about 8 years ago
- Related to Task #2452: Upgrade python-celery in Fedora Rawhide to 4.0 added
Updated by bmbouter about 8 years ago
- Description updated (diff)
I rewrote the issue some with details that I know about this change.
Since we still need 2.y to use Celery 3.1.z the spec file change we need is only to remove this line.
Updated by bmbouter about 8 years ago
- Related to Issue #489: When an task worker dies or restarts, work assigned to that worker is cancelled added
Updated by bmbouter almost 8 years ago
Here is a problem statement about why it's unsafe for us to use Celery 4 without a change to how we clean up tasks when workers die. With Celery 4, when a worker dies it's queue is not deleted so any tasks in the queue will stay there until the worker reconnects. When Pulp detects the worker is lost it will issue a Celery cancellation, but only online workers will receive it and when the worker is started later it won't know those tasks have been canceled so they will be run. In other words the following would happen:
1. A worker has sync or publish tasks in its queue
2. That worker dies (say OOM) leaving the tasks in it's dedicated queue
3. Pulp detects the failure and issues cancellation that the worker never receives. The tasks are shown as cancelled to the user because the TaskStatus is updated as such.
4. The worker is restarted and connects to the same dedicated queue
5. The previous tasks (which were canceled) run unexpectedly
Updated by mhrivnak almost 8 years ago
- Priority changed from Normal to High
- Sprint/Milestone set to 33
- Groomed changed from No to Yes
- Sprint Candidate changed from No to Yes
Since this is needed quickly for the Fedora 26 branch date on Feb 28, I am adding it to this sprint and bumping the priority.
Updated by bmbouter almost 8 years ago
- Priority changed from High to Normal
- Sprint/Milestone deleted (
33)
Here is the best approach I have to resolve the bug described in comment #5. It's high level so there are some details missing.
- Assuming there are two workers A and B
- Assume there are many sync,publish, sync, publish, etc tasks in the dedicated queues for workers A and B.
- Assume that there are also many tasks waiting in the resource manager queue
- Assume that worker A is killed with SIGKILL.
If the user is running Celery 3.1.z, the following should happen which is exactly what we do today.
1. After 30 seconds the pulp_celerybeat
process will observe worker A has died
2. pulp_celerybeat
will clean up the worker record in the workers table
3. Since the tasks are lost the TaskStatus records will also be cancelled.
If the user is running Celery 4.0.0+, the following should happen instead of what we do today.
1. After 30 seconds the pulp_celerybeat
process will observe worker A has died
2. pulp_celerybeat
will clean up the worker record in the workers table
3. A new lock will be created (somehow) which will delay the worker from starting up. We already have the worker checking locks during startup so we have an easy opportunity to write a new lock for this purpose. For discussion purposes call this the 'drain worker named X' lock.
4. a pseudocode block like this will be added to the top of the _queue_reserved_task
that the resource manager runs.
For X in each 'drain worker named X' locks:
fetch the next message as (call it next_msg) from the head of the dedicated queue for worker X
dispatch next_msg to an appropriate worker's dedicated queue
ensure to update the reserved resource so that it's not routed back to the same worker
# the queue should be drained by this point
delete the dedicated queue associated with worker X
delete the 'drain worker named X' lock which allows the worker to finish starting whenever it is started next
# continue handling the task the resource manager was already executing by continuing on with the existing code "here":https://github.com/pulp/pulp/blob/6782dfe8a87593dd5579f0bc4a87dea00f4235ea/server/pulp/server/async/tasks.py#L137
Updated by bmbouter almost 8 years ago
- Priority changed from Normal to High
- Sprint/Milestone set to 33
I didn't intend to unset the Sprint or adjust the priority with my last comment.
Updated by dalley almost 8 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
- Priority changed from High to Normal
- Sprint/Milestone deleted (
33)
Updated by dalley almost 8 years ago
- Priority changed from Normal to High
- Sprint/Milestone set to 33
Updated by dalley almost 8 years ago
Another issue that's popping up under Celery 4.0.z:
[vagrant@dev ~]$ systemctl status -n 800 pulp_celerybeat.service
● pulp_celerybeat.service - Pulp's Celerybeat
Loaded: loaded (/home/vagrant/devel/pulp/server/usr/lib/systemd/system/pulp_celerybeat.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2017-02-20 20:38:21 UTC; 9s ago
Process: 8830 ExecStart=/usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler (code=exited, status=1/FAILURE)
Main PID: 8830 (code=exited, status=1/FAILURE)
Feb 20 20:38:20 dev.example.com systemd[1]: Started Pulp's Celerybeat.
Feb 20 20:38:21 dev.example.com celery[8830]: celery beat v4.0.2 (latentcall) is starting.
Feb 20 20:38:21 dev.example.com celery[8830]: Traceback (most recent call last):
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/bin/celery", line 11, in <module>
Feb 20 20:38:21 dev.example.com celery[8830]: sys.exit(main())
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/__main__.py", line 14, in main
Feb 20 20:38:21 dev.example.com celery[8830]: _main()
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 326, in main
Feb 20 20:38:21 dev.example.com celery[8830]: cmd.execute_from_commandline(argv)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
Feb 20 20:38:21 dev.example.com celery[8830]: super(CeleryCommand, self).execute_from_commandline(argv)))
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
Feb 20 20:38:21 dev.example.com celery[8830]: return self.handle_argv(self.prog_name, argv[1:])
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 480, in handle_argv
Feb 20 20:38:21 dev.example.com celery[8830]: return self.execute(command, argv)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 412, in execute
Feb 20 20:38:21 dev.example.com celery[8830]: ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 285, in run_from_argv
Feb 20 20:38:21 dev.example.com celery[8830]: sys.argv if argv is None else argv, command)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 368, in handle_argv
Feb 20 20:38:21 dev.example.com celery[8830]: return self(*args, **options)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 244, in __call__
Feb 20 20:38:21 dev.example.com celery[8830]: ret = self.run(*args, **kwargs)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/bin/beat.py", line 107, in run
Feb 20 20:38:21 dev.example.com celery[8830]: return beat().run()
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/apps/beat.py", line 77, in run
Feb 20 20:38:21 dev.example.com celery[8830]: self.init_loader()
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/apps/beat.py", line 127, in init_loader
Feb 20 20:38:21 dev.example.com celery[8830]: self.app.finalize()
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/app/base.py", line 512, in finalize
Feb 20 20:38:21 dev.example.com celery[8830]: _announce_app_finalized(self)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/_state.py", line 52, in _announce_app_finalized
Feb 20 20:38:21 dev.example.com celery[8830]: callback(app)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/app/base.py", line 412, in cons
Feb 20 20:38:21 dev.example.com celery[8830]: return app._task_from_fun(fun, **opts)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/app/base.py", line 453, in _task_from_fun
Feb 20 20:38:21 dev.example.com celery[8830]: '__header__': staticmethod(head_from_fun(fun, bound=bind)),
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/celery/utils/functional.py", line 275, in head_from_fun
Feb 20 20:38:21 dev.example.com celery[8830]: fun_args=_argsfromspec(getfullargspec(fun)),
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib/python2.7/site-packages/vine/five.py", line 350, in getfullargspec
Feb 20 20:38:21 dev.example.com celery[8830]: s = _getargspec(fun)
Feb 20 20:38:21 dev.example.com celery[8830]: File "/usr/lib64/python2.7/inspect.py", line 815, in getargspec
Feb 20 20:38:21 dev.example.com celery[8830]: raise TypeError('{!r} is not a Python function'.format(func))
Feb 20 20:38:21 dev.example.com celery[8830]: TypeError: <method-wrapper '__call__' of instancemethod object at 0x7f47e125fa00> is not a Python function
Feb 20 20:38:21 dev.example.com systemd[1]: pulp_celerybeat.service: Main process exited, code=exited, status=1/FAILURE
Feb 20 20:38:21 dev.example.com systemd[1]: pulp_celerybeat.service: Unit entered failed state.
Feb 20 20:38:21 dev.example.com systemd[1]: pulp_celerybeat.service: Failed with result 'exit-code'.
As you can see, this traceback is entirely within Celery and doesn't include any calls into our code, so it should be fun to debug. All Celery processes are failing like this.
A similar but not identical traceback can be found here:
Updated by bmbouter almost 8 years ago
After a lot of debugging I can see that celery 4.0.2 is unable to add two specific tasks to the registry. Here is a diff that allows Pulp to start normally when applied on top of @dalley's work in progress PR.
diff --git a/server/pulp/server/managers/repo/unit_association.py b/server/pulp/server/managers/repo/unit_association.py
index ea25b06..06ff730 100644
--- a/server/pulp/server/managers/repo/unit_association.py
+++ b/server/pulp/server/managers/repo/unit_association.py
@@ -433,8 +433,6 @@ class RepoUnitAssociationManager(object):
return bool(existing_count)
-associate_from_repo = task(RepoUnitAssociationManager.associate_from_repo, base=Task)
-unassociate_by_criteria = task(RepoUnitAssociationManager.unassociate_by_criteria, base=Task)
def load_associated_units(source_repo_id, criteria):
Updated by bmbouter almost 8 years ago
- Description updated (diff)
mhrivnak pointed out that the problem statement in Comment 4 won't be an issue. The reason is that Pulp TaskStatus records are checked before each task is run, and if the task being run is in a cancelled or "final" state it is not run. That check is here.
So the existing codepaths already handle this case well. Because of that we are only delivering the following things with this fix:
- The C.dq2 change with Celery4
- A undocumented API change that dalley discovered and fixed along the way
- The conversion of the two tasks in Comment 11 which need to be converted from @classmethod to @staticmethod to work with both Celery 3.1 and Celery4.
I've edited the description some also.
Updated by bmbouter almost 8 years ago
- Blocks Task #2604: Update Pulp to latest GA in Rawhide and include patch for Issue 2527 added
Updated by bmbouter almost 8 years ago
To setup my environment for this I start with a fresh vagrant environment. I then run:
sudo rpm -e --nodeps python-kombu
sudo rpm -e --nodeps python2-celery
sudo rpm -e --nodeps python-amqp
sudo rpm -e --nodeps python-billiard
sudo pip install celery
sudo pip uninstall kombu
sudo rpm -ivh --nodeps https://kojipkgs.fedoraproject.org//packages/python-kombu/4.0.2/4.fc26/noarch/python2-kombu-4.0.2-4.fc26.noarch.rpm
Make sure you have a version of Kombu running that is patched with the fix from https://github.com/celery/kombu/pull/700.
Updated by bmbouter almost 8 years ago
Celery4 also depends on a major release of Kombu4 which is also coming with the current Rawhide (F26)[0]. During testing of the PR with this issue with both Celery4 AND Kombu4 from rawhide, one blocking issue was discovered and reported in upstream Kombu[1]. I a PR which fixed the issue in upstream Kombu[2].
I'm creating a separate issue for me to update the Kombu release in Rawhide to include this patch[2]. That issue will also block #2604.
[0]: https://apps.fedoraproject.org/packages/python-kombu
[1]: https://github.com/celery/kombu/issues/699
[2]: https://github.com/celery/kombu/pull/700
Updated by bmbouter almost 8 years ago
Created the "Patch Kombu in Fedora" issue here[0].
Updated by bmbouter almost 8 years ago
- Related to Task #2609: Patch Kombu in Rawhide to work with Pulp added
Added by dalley almost 8 years ago
Added by dalley almost 8 years ago
Revision fba39f1a | View on GitHub
Update Pulp to support Celery 4.0.z.
Make Pulp compatible with both Celery 3.1.z and Celery 4.0.z
closes #2527
Updated by dalley almost 8 years ago
- Status changed from ASSIGNED to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|fba39f1a82bbf8666df64999e89bc21d6d5ec897.
Updated by bmbouter almost 8 years ago
- Sprint/Milestone changed from 34 to 33
I want to pull this back to Sprint 15 since it just stayed in review for a while before moved to MODIFIED this morning.
Updated by semyers almost 8 years ago
- Tracker changed from Story to Task
- Platform Release set to 2.12.2
Updated by bmbouter almost 8 years ago
- Blocks deleted (Task #2604: Update Pulp to latest GA in Rawhide and include patch for Issue 2527)
Updated by bizhang almost 8 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Update Pulp to support Celery 4.0.z.
Make Pulp compatible with both Celery 3.1.z and Celery 4.0.z
closes #2527