Issue #2496
closedKilling pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running
Description
To reproduce:
1. start with a running healthy pulp system
2. kill pulp_celerybeat, pulp_resource_manager, and pulp_workers in that order
[vagrant@dev ~]$ sudo pkill -9 -f 'celery beat'
[vagrant@dev ~]$ sudo pkill -9 -f 'celery worker'
3. Verify they are stopped
[vagrant@dev ~]$ ps -awfux | grep celery
vagrant 1154 0.0 0.0 12736 996 pts/0 S+ 22:18 0:00 \_ grep --color=auto celery
4. Look at the status API and see that processes are reported as still running
[vagrant@dev ~]$ pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: scheduler@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:22:28Z
_id: resource_manager@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-1@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-2@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-0@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-3@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.10.3b2
Related issues
Updated by bmbouter over 6 years ago
I think the main deficiency here is that generally everywhere that reads the Worker records considers things alive if the Worker records are present or not. The timestamps are ignored almost everywhere. Currently only thing that actually checks timestamps is the worker watcher thread in pulp_celerybeat which if you kill that first the records never get cleaned up.
Recently, we've planned to teach pulp-manage-db to look at the timestamps to evaluate for itself if there are still workers running. This is a neat trick because now pulp-manage-db is not dependant on the records being maintained by pulp_celerybeat. This same approach could be applied in the limited places that read workers so that even if the records are still present, once the timestamp ages out the record will be ignored. Specifically we could apply this to:
- The resource manager task which looks for workers
- The /status/ api view
I think ^ are the only places.
This leaves only 1 lingering problem, which is that pulp_celerybeat also writes the records. I think it would be better if all workers wrote their own records to the db instead of sending them through the message bus. Without this the pulp-manage-db waiting for workers could continue mistakenly if pulp-manage-db was killed even though workers are still running. This is because their timestamps would not be updated even though they are still heartbeating. I could even file this as a separate bug if someone encourage me to. It technically is a separate problem.
Updated by bmbouter over 6 years ago
- Related to Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running added
Updated by mhrivnak over 6 years ago
Since we want to do the same query in multiple places, it's worth considering a custom ```QuerySetManager``` on the Worker model that applies the right filters. That would let us define "the rules" for identifying live workers in one place, near the model itself.
Updated by bmbouter over 6 years ago
mhrivnak: that is a great idea; let's do that.
Updated by jortel@redhat.com about 6 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to jortel@redhat.com
Updated by jortel@redhat.com about 6 years ago
- Status changed from ASSIGNED to NEW
- Assignee deleted (
jortel@redhat.com)
Updated by dkliban@redhat.com about 6 years ago
I added a new issue https://pulp.plan.io/issues/2519 to address last part of https://pulp.plan.io/issues/2496#note-1
Updated by dkliban@redhat.com about 6 years ago
- Related to Story #2519: Enable workers to record their own heartbeat records to the database added
Updated by daviddavis about 6 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to daviddavis
Updated by daviddavis about 6 years ago
What sort of time limit do we want to use here (i.e. after how many seconds can assume the process died)?
Updated by bmbouter about 6 years ago
As of 2.12, any worker who's check-in timestamp is older than 25 seconds should be considered missing.
Additionally, I think this is blocked on 2519. I should have put that on earlier.
Updated by bmbouter about 6 years ago
- Related to deleted (Story #2519: Enable workers to record their own heartbeat records to the database)
Updated by bmbouter about 6 years ago
- Blocked by Story #2519: Enable workers to record their own heartbeat records to the database added
Updated by daviddavis about 6 years ago
- Status changed from ASSIGNED to NEW
- Assignee deleted (
daviddavis)
Unassigning myself since this is not ready.
Updated by dkliban@redhat.com about 6 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
Updated by dkliban@redhat.com about 6 years ago
- Status changed from ASSIGNED to POST
Added by dkliban@redhat.com about 6 years ago
Added by dkliban@redhat.com about 6 years ago
Problem: Stale worker documents present in the db
Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:
-
Status API
-
Resource manager code that looks for available workers
Updated by dkliban@redhat.com about 6 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulp|d92a1deada56788bf862b26e61dc9b0a9027e2ff.
Updated by pthomas@redhat.com almost 6 years ago
verified
[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery beat'
Killed
[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery worker'
Killed
[root@ibm-x3550m3-09 ~]# ps -awfux | grep celery
root 6039 0.0 0.0 112648 960 pts/0 S+ 14:16 0:00 \_ grep --color=auto celery
[root@ibm-x3550m3-09 ~]# pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: reserved_resource_worker-0@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-4@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-3@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-6@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-2@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-5@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-17@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-7@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-14@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-8@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-11@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-18@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-12@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-16@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-15@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-20@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-21@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-19@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-13@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-22@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-23@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: resource_manager@ibm-x3550m3-09.lab.eng.brq.redhat.com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:42Z
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.13b1
[root@ibm-x3550m3-09 ~]# pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.13b1
[root@ibm-x3550m3-09 ~]#
Updated by pcreech almost 6 years ago
- Status changed from 5 to CLOSED - CURRENTRELEASE
Updated by bmbouter about 5 years ago
- Sprint changed from Sprint 16 to Sprint 14
Problem: Stale worker documents present in the db
Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:
Status API
Resource manager code that looks for available workers
closes #2496 https://pulp.plan.io/issues/2496