Task #3121
closedSimplify tracking of worker 'online' state
100%
Description
The "online" state of worker is currently dependent on two factors. The 'online' field, present on the Worker model, and also whether the last recorded heartbeat is within the timeout interval (30 seconds) of the present time. These two conditions are set in multiple different places, and the online status of the worker based upon those two values is evaluated in multiple different places.
I propose a few different changes to DRY this up.
The canonical reference for whether a worker is online (based on the multiple different criteria) should be a property on the worker model named "online". The field currently named "online" should be renamed to "gracefully_stopped". The "online" property should then check both the current-ness of the timestamp as well as the "gracefully_stopped" field The worker serializer should be updated to include both of these. Currently there exist several parts of the code which set or reference the "online" field, and these should be removed.
The "gracefully_stopped" field should be True only when the worker is shutdown via normal procedures. During operation, and after a hard-kill (e.g. OOM), this value should be false. The default value of this field should be "True", and it should be updated to "False" whenever "save_heartbeat()" is called for that worker, so that the value will be set to "False" during normal operation. It should be set to True as part of the shutdown sequence for the worker.
Related issues
Updated by bmbouter about 7 years ago
This story is well written, and the 'is_online' property changes all make perfect sense.
In terms of keeping the 'online' boolean itself that is really just for the graceful shutdown case right? If we got rid of it entirely the only issue users would have is that when shutting down a worker gracefully is_online would show True for an additional 30 seconds right?
Updated by dalley about 7 years ago
Actually, there is another scenario. When a worker first goes offline we probably want to log some message about it. If this happens in a non-gentle way, we need some way to distinguish the workers which are supposed to be online (but not heartbeating) apart from the others so that we can print a message once, and only once, since we are keeping the old worker records around instead of deleting them. This would be accomplished by setting online=False after logging the message.
Updated by bmbouter about 7 years ago
Having the 'is_online' and 'online' will be confusing for developers. What about renaming 'online' to be 'gracefully_shutdown' which we could record to True during graceful shutdown and set to False in all other places? It would also be included in the viewset and would be user read-only. I like that better because it creates more value (knowledge) for the user and solves the user case.
What do you think about ^?
Updated by dkliban@redhat.com about 7 years ago
bmbouter wrote:
Having the 'is_online' and 'online' will be confusing for developers. What about renaming 'online' to be 'gracefully_shutdown' which we could record to True during graceful shutdown and set to False in all other places? It would also be included in the viewset and would be user read-only. I like that better because it creates more value (knowledge) for the user and solves the user case.
What do you think about ^?
I would prefer 'online' and 'gracefully_stopped'.
Updated by bmbouter about 7 years ago
- Description updated (diff)
- Groomed changed from No to Yes
- Sprint Candidate changed from No to Yes
I'm +1 on 'gracefully_stopped' also because it more clearly past tense and seems like simpler language.
I think this story is good to go, so I'm grooming.
Updated by dalley about 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dalley
Updated by daviddavis almost 7 years ago
- Related to Story #3143: As an authenticated user, I can filter workers. added
Updated by dalley almost 7 years ago
- Status changed from ASSIGNED to NEW
I still plan to do this but it's been moved to the very end of my list of priorities by other work.
Updated by jortel@redhat.com almost 7 years ago
- Sprint/Milestone changed from 53 to 54
Updated by dalley over 6 years ago
- Assignee deleted (
dalley) - Sprint/Milestone deleted (
56)
Updated by daviddavis over 6 years ago
Added a checklist item to create a testcase in pulp-smash ;)
Updated by jortel@redhat.com over 6 years ago
- Sprint changed from Sprint 33 to Sprint 34
Updated by dalley over 6 years ago
- Status changed from ASSIGNED to POST
Added by dalley over 6 years ago
Added by dalley over 6 years ago
Revision f852da10 | View on GitHub
Simplify worker 'online' state tracking
Added by dalley over 6 years ago
Revision 501cdd8a | View on GitHub
Additional cleanup and refactoring
Added by dalley over 6 years ago
Revision 501cdd8a | View on GitHub
Additional cleanup and refactoring
Updated by dalley over 6 years ago
- Status changed from POST to MODIFIED
- % Done changed from 0 to 100
Applied in changeset pulp|f852da1053f99e32cb7a543387f41bc998a77227.
Updated by bmbouter almost 5 years ago
- Status changed from MODIFIED to CLOSED - CURRENTRELEASE
Simplify worker 'online' state tracking
closes #3121 https://pulp.plan.io/issues/3121