Develop a plan for how Pulp's master branch should determine when a worker was last seen
In Issue #1380 it was determined that Pulp's workers were going missing due to some complicated interactions between Celery, Pulp, and Django. The quick fix done there was to use the Celery Event's "local_received" attribute instead of its "timestamp" attribute. There are two things different between these attributes, the first of which fixes the bug and the second of which changes Pulp's worker watcher behavior in a way that could be significant:
0) The "timestamp" attribute suffers from this Celery bug and the "local_received" attribute does not.
1) The "timestamp" attribute is created by the sender of the event at creation time. The "local_received" attribute is created by the receiver of the event at receipt time.
Pulp is currently using the "local_received" attribute to work around the bug in #1380. The purpose of this task is to have a place to discuss what Pulp should do in the future as there was some debate about this in #pulp on Freenode. I have described three options below, but please feel free to modify this description if you feel that there are more options that should be considered:
0) Wait for Celery to fix the bug and switch back to using the "timestamp" attribute.
1) Drop all usage of Celery provided times and calculate the time we received the event ourselves.
2) Do nothing and leave the code as is, using the "local_received" attribute.
Please use the comments to debate about the relative pros and cons of the above options.
#2 Updated by rbarlow almost 5 years ago
I would like to make a case for option #1, where Pulp does not rely on Celery's timestamps at all. Celery's timestamp handling is currently shaky due to their use of local time conversions, and I am concerned about what approach they may take to fix the bug on their end. If we ignore their timestamps and simply generate a TZ aware datetime object when we receive the worker's heartbeat messages, I perceive these benefits:
- We detach ourselves from any backwards incompatible changes that Celery may introduce in this area in the future.
- By not going back to the "timestamp" attribute in the future when/if Celery fixes the bug, we become less sensitive to clock drifts which in turn makes Pulp more robust in my opinion. For the same reason, Pulp is also less sensitive to a slow or laggy message broker.
- We are in full control of how the times are created and can therefore ensure that TZ-aware UTC-only datetimes are used throughout our code.
- We will not have to backport the Celery fix to the python-celery package in our repository.
The noted counter from #pulp to my second point is that if a worker goes missing on a laggy/slow message broker, it will take longer than usual for Pulp to detect it has disappeared and will continue to assign work to it for longer. In my opinion, the gained robustness from this proposal outweighs this objection, but I wanted to document it as well for consideration. Please feel free to weigh in on this point.
#7 Updated by bmbouter over 1 year ago
Pulp 2 is approaching maintenance mode, and this Pulp 2 ticket is not being actively worked on. As such, it is being closed as WONTFIX. Pulp 2 is still accepting contributions though, so if you want to contribute a fix for this ticket, please reopen or comment on it. If you don't have permissions to reopen this ticket, or you want to discuss an issue, please reach out via the developer mailing list.
Please register to edit this issue