Project

Profile

Help

Task #2527

Ensure Pulp 2.y works with both python-celery 3.1.z and 4.0.z

Added by bizhang almost 4 years ago. Updated over 1 year ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Platform Release:
2.12.2
Groomed:
Yes
Sprint Candidate:
Yes
Tags:
Pulp 2
Sprint:
Sprint 15
Quarter:

Description

Problem

python-celery was recently upgrade from 3.1.20-3 to 4.0.2-1 on Fedora rawhide [0]. Celery 4 has some backwards incompatible changes that affect Pulp's usage of celery. This means that until we update Pulp 2.y to work with Celery 4.0.z Pulp will not be able to work with rawhide (and F26). There are at least 2 problems, but there may be more.

  • Tasks with reservation won't work because the the dedicated queue name was changed from C.dq (celery 3.1.z) to C.dq2 (celery 4.0). That variable is set here.
  • A undocumented API change that dalley discovered and fixed along the way
  • The conversion of the two tasks in Comment 11 which need to be converted from @classmethod to @staticmethod to work with both Celery 3.1 and Celery4.

Solution

Pulp3 already works with celery 4.0.z, however it is unlikely to make the F26 beta freeze date of 4-28. So in the interim we need to update pulp2 to work with python-celery 4.0.z It also has to maintain compatibility with python-celery 3.1.z which will continue to be used on EL7, and F25-.

[0] https://apps.fedoraproject.org/packages/python-celery


Checklist


Related issues

Related to Pulp - Task #2452: Upgrade python-celery in Fedora Rawhide to 4.0CLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Issue #489: When an RQ worker dies or restarts, work assigned to that worker is cancelledNEW<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Related to Pulp - Task #2609: Patch Kombu in Rawhide to work with PulpCLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision fba39f1a View on GitHub
Added by dalley almost 4 years ago

Update Pulp to support Celery 4.0.z.

Make Pulp compatible with both Celery 3.1.z and Celery 4.0.z

closes #2527

Revision fba39f1a View on GitHub
Added by dalley almost 4 years ago

Update Pulp to support Celery 4.0.z.

Make Pulp compatible with both Celery 3.1.z and Celery 4.0.z

closes #2527

History

#1 Updated by bmbouter almost 4 years ago

  • Related to Task #2452: Upgrade python-celery in Fedora Rawhide to 4.0 added

#2 Updated by bmbouter almost 4 years ago

  • Checklist item deleted (Update the spec file)
  • Checklist item Remove the python-celery < 3.2.0 from spec file (see comment 1) added
  • Checklist item add release note for this change added
  • Description updated (diff)

I rewrote the issue some with details that I know about this change.

Since we still need 2.y to use Celery 3.1.z the spec file change we need is only to remove this line.

#3 Updated by bmbouter almost 4 years ago

  • Related to Issue #489: When an RQ worker dies or restarts, work assigned to that worker is cancelled added

#4 Updated by bmbouter almost 4 years ago

Here is a problem statement about why it's unsafe for us to use Celery 4 without a change to how we clean up tasks when workers die. With Celery 4, when a worker dies it's queue is not deleted so any tasks in the queue will stay there until the worker reconnects. When Pulp detects the worker is lost it will issue a Celery cancellation, but only online workers will receive it and when the worker is started later it won't know those tasks have been canceled so they will be run. In other words the following would happen:

1. A worker has sync or publish tasks in its queue
2. That worker dies (say OOM) leaving the tasks in it's dedicated queue
3. Pulp detects the failure and issues cancellation that the worker never receives. The tasks are shown as cancelled to the user because the TaskStatus is updated as such.
4. The worker is restarted and connects to the same dedicated queue
5. The previous tasks (which were canceled) run unexpectedly

#5 Updated by mhrivnak almost 4 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to 33
  • Groomed changed from No to Yes
  • Sprint Candidate changed from No to Yes

Since this is needed quickly for the Fedora 26 branch date on Feb 28, I am adding it to this sprint and bumping the priority.

#6 Updated by bmbouter almost 4 years ago

  • Priority changed from High to Normal
  • Sprint/Milestone deleted (33)

Here is the best approach I have to resolve the bug described in comment #5. It's high level so there are some details missing.

  • Assuming there are two workers A and B
  • Assume there are many sync,publish, sync, publish, etc tasks in the dedicated queues for workers A and B.
  • Assume that there are also many tasks waiting in the resource manager queue
  • Assume that worker A is killed with SIGKILL.

If the user is running Celery 3.1.z, the following should happen which is exactly what we do today.

1. After 30 seconds the pulp_celerybeat process will observe worker A has died
2. pulp_celerybeat will clean up the worker record in the workers table
3. Since the tasks are lost the TaskStatus records will also be cancelled.

If the user is running Celery 4.0.0+, the following should happen instead of what we do today.

1. After 30 seconds the pulp_celerybeat process will observe worker A has died
2. pulp_celerybeat will clean up the worker record in the workers table
3. A new lock will be created (somehow) which will delay the worker from starting up. We already have the worker checking locks during startup so we have an easy opportunity to write a new lock for this purpose. For discussion purposes call this the 'drain worker named X' lock.
4. a pseudocode block like this will be added to the top of the _queue_reserved_task that the resource manager runs.

For X in each 'drain worker named X' locks:
    fetch the next message as (call it next_msg) from the head of the dedicated queue for worker X
        dispatch next_msg to an appropriate worker's dedicated queue
        ensure to update the reserved resource so that it's not routed back to the same worker
    # the queue should be drained by this point
    delete the dedicated queue associated with worker X
    delete the 'drain worker named X' lock which allows the worker to finish starting whenever it is started next
# continue handling the task the resource manager was already executing by continuing on with the existing code "here":https://github.com/pulp/pulp/blob/6782dfe8a87593dd5579f0bc4a87dea00f4235ea/server/pulp/server/async/tasks.py#L137

#7 Updated by bmbouter almost 4 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to 33

I didn't intend to unset the Sprint or adjust the priority with my last comment.

#8 Updated by dalley almost 4 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dalley
  • Priority changed from High to Normal
  • Sprint/Milestone deleted (33)

#9 Updated by dalley almost 4 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to 33

#10 Updated by dalley almost 4 years ago

Another issue that's popping up under Celery 4.0.z:

[vagrant@dev ~]$ systemctl status -n 800 pulp_celerybeat.service 
‚óŹ pulp_celerybeat.service - Pulp's Celerybeat
   Loaded: loaded (/home/vagrant/devel/pulp/server/usr/lib/systemd/system/pulp_celerybeat.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2017-02-20 20:38:21 UTC; 9s ago
  Process: 8830 ExecStart=/usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler (code=exited, status=1/FAILURE)
 Main PID: 8830 (code=exited, status=1/FAILURE)

Feb 20 20:38:20 dev.example.com systemd[1]: Started Pulp's Celerybeat.
Feb 20 20:38:21 dev.example.com celery[8830]: celery beat v4.0.2 (latentcall) is starting.
Feb 20 20:38:21 dev.example.com celery[8830]: Traceback (most recent call last):
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/bin/celery", line 11, in <module>
Feb 20 20:38:21 dev.example.com celery[8830]:     sys.exit(main())
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/__main__.py", line 14, in main
Feb 20 20:38:21 dev.example.com celery[8830]:     _main()
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 326, in main
Feb 20 20:38:21 dev.example.com celery[8830]:     cmd.execute_from_commandline(argv)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
Feb 20 20:38:21 dev.example.com celery[8830]:     super(CeleryCommand, self).execute_from_commandline(argv)))
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
Feb 20 20:38:21 dev.example.com celery[8830]:     return self.handle_argv(self.prog_name, argv[1:])
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 480, in handle_argv
Feb 20 20:38:21 dev.example.com celery[8830]:     return self.execute(command, argv)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/celery.py", line 412, in execute
Feb 20 20:38:21 dev.example.com celery[8830]:     ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 285, in run_from_argv
Feb 20 20:38:21 dev.example.com celery[8830]:     sys.argv if argv is None else argv, command)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 368, in handle_argv
Feb 20 20:38:21 dev.example.com celery[8830]:     return self(*args, **options)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/base.py", line 244, in __call__
Feb 20 20:38:21 dev.example.com celery[8830]:     ret = self.run(*args, **kwargs)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/bin/beat.py", line 107, in run
Feb 20 20:38:21 dev.example.com celery[8830]:     return beat().run()
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/apps/beat.py", line 77, in run
Feb 20 20:38:21 dev.example.com celery[8830]:     self.init_loader()
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/apps/beat.py", line 127, in init_loader
Feb 20 20:38:21 dev.example.com celery[8830]:     self.app.finalize()
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/app/base.py", line 512, in finalize
Feb 20 20:38:21 dev.example.com celery[8830]:     _announce_app_finalized(self)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/_state.py", line 52, in _announce_app_finalized
Feb 20 20:38:21 dev.example.com celery[8830]:     callback(app)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/app/base.py", line 412, in cons
Feb 20 20:38:21 dev.example.com celery[8830]:     return app._task_from_fun(fun, **opts)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/app/base.py", line 453, in _task_from_fun
Feb 20 20:38:21 dev.example.com celery[8830]:     '__header__': staticmethod(head_from_fun(fun, bound=bind)),
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/celery/utils/functional.py", line 275, in head_from_fun
Feb 20 20:38:21 dev.example.com celery[8830]:     fun_args=_argsfromspec(getfullargspec(fun)),
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib/python2.7/site-packages/vine/five.py", line 350, in getfullargspec
Feb 20 20:38:21 dev.example.com celery[8830]:     s = _getargspec(fun)
Feb 20 20:38:21 dev.example.com celery[8830]:   File "/usr/lib64/python2.7/inspect.py", line 815, in getargspec
Feb 20 20:38:21 dev.example.com celery[8830]:     raise TypeError('{!r} is not a Python function'.format(func))
Feb 20 20:38:21 dev.example.com celery[8830]: TypeError: <method-wrapper '__call__' of instancemethod object at 0x7f47e125fa00> is not a Python function
Feb 20 20:38:21 dev.example.com systemd[1]: pulp_celerybeat.service: Main process exited, code=exited, status=1/FAILURE
Feb 20 20:38:21 dev.example.com systemd[1]: pulp_celerybeat.service: Unit entered failed state.
Feb 20 20:38:21 dev.example.com systemd[1]: pulp_celerybeat.service: Failed with result 'exit-code'.

As you can see, this traceback is entirely within Celery and doesn't include any calls into our code, so it should be fun to debug. All Celery processes are failing like this.

A similar but not identical traceback can be found here:

https://github.com/celery/celery/issues/3723

#11 Updated by bmbouter almost 4 years ago

After a lot of debugging I can see that celery 4.0.2 is unable to add two specific tasks to the registry. Here is a diff that allows Pulp to start normally when applied on top of @dalley's work in progress PR.

diff --git a/server/pulp/server/managers/repo/unit_association.py b/server/pulp/server/managers/repo/unit_association.py
index ea25b06..06ff730 100644
--- a/server/pulp/server/managers/repo/unit_association.py
+++ b/server/pulp/server/managers/repo/unit_association.py
@@ -433,8 +433,6 @@ class RepoUnitAssociationManager(object):
         return bool(existing_count)


-associate_from_repo = task(RepoUnitAssociationManager.associate_from_repo, base=Task)
-unassociate_by_criteria = task(RepoUnitAssociationManager.unassociate_by_criteria, base=Task)


 def load_associated_units(source_repo_id, criteria):

#12 Updated by bmbouter almost 4 years ago

  • Description updated (diff)

mhrivnak pointed out that the problem statement in Comment 4 won't be an issue. The reason is that Pulp TaskStatus records are checked before each task is run, and if the task being run is in a cancelled or "final" state it is not run. That check is here.

So the existing codepaths already handle this case well. Because of that we are only delivering the following things with this fix:

  • The C.dq2 change with Celery4
  • A undocumented API change that dalley discovered and fixed along the way
  • The conversion of the two tasks in Comment 11 which need to be converted from @classmethod to @staticmethod to work with both Celery 3.1 and Celery4.

I've edited the description some also.

#13 Updated by bmbouter almost 4 years ago

  • Blocks Task #2604: Update Pulp to latest GA in Rawhide and include patch for Issue 2527 added

#14 Updated by bizhang almost 4 years ago

  • Status changed from ASSIGNED to 5

#15 Updated by dalley almost 4 years ago

  • Status changed from 5 to ASSIGNED

#16 Updated by bmbouter almost 4 years ago

To setup my environment for this I start with a fresh vagrant environment. I then run:

sudo rpm -e --nodeps python-kombu
sudo rpm -e --nodeps python2-celery
sudo rpm -e --nodeps python-amqp
sudo rpm -e --nodeps python-billiard
sudo pip install celery
sudo pip uninstall kombu
sudo rpm -ivh --nodeps https://kojipkgs.fedoraproject.org//packages/python-kombu/4.0.2/4.fc26/noarch/python2-kombu-4.0.2-4.fc26.noarch.rpm

Make sure you have a version of Kombu running that is patched with the fix from https://github.com/celery/kombu/pull/700.

#17 Updated by bmbouter almost 4 years ago

Celery4 also depends on a major release of Kombu4 which is also coming with the current Rawhide (F26)[0]. During testing of the PR with this issue with both Celery4 AND Kombu4 from rawhide, one blocking issue was discovered and reported in upstream Kombu[1]. I a PR which fixed the issue in upstream Kombu[2].

I'm creating a separate issue for me to update the Kombu release in Rawhide to include this patch[2]. That issue will also block #2604.

[0]: https://apps.fedoraproject.org/packages/python-kombu
[1]: https://github.com/celery/kombu/issues/699
[2]: https://github.com/celery/kombu/pull/700

#18 Updated by bmbouter almost 4 years ago

Created the "Patch Kombu in Fedora" issue here[0].

[0]: https://pulp.plan.io/issues/2609

#19 Updated by bmbouter almost 4 years ago

  • Related to Task #2609: Patch Kombu in Rawhide to work with Pulp added

#22 Updated by mhrivnak almost 4 years ago

  • Sprint/Milestone changed from 33 to 34

#23 Updated by dalley almost 4 years ago

  • Status changed from ASSIGNED to MODIFIED
  • % Done changed from 0 to 100

#24 Updated by bmbouter almost 4 years ago

  • Sprint/Milestone changed from 34 to 33

I want to pull this back to Sprint 15 since it just stayed in review for a while before moved to MODIFIED this morning.

#25 Updated by semyers almost 4 years ago

  • Tracker changed from Story to Task
  • Platform Release set to 2.12.2

#26 Updated by bmbouter over 3 years ago

  • Blocks deleted (Task #2604: Update Pulp to latest GA in Rawhide and include patch for Issue 2527)

#27 Updated by semyers over 3 years ago

  • Status changed from MODIFIED to 5

#28 Updated by bizhang over 3 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#29 Updated by bmbouter over 2 years ago

  • Sprint set to Sprint 15

#30 Updated by bmbouter over 2 years ago

  • Sprint/Milestone deleted (33)

#31 Updated by bmbouter over 1 year ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF