Issue #2496
closed
Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running
Status:
CLOSED - CURRENTRELEASE
Description
To reproduce:
1. start with a running healthy pulp system
2. kill pulp_celerybeat, pulp_resource_manager, and pulp_workers in that order
[vagrant@dev ~]$ sudo pkill -9 -f 'celery beat'
[vagrant@dev ~]$ sudo pkill -9 -f 'celery worker'
3. Verify they are stopped
[vagrant@dev ~]$ ps -awfux | grep celery
vagrant 1154 0.0 0.0 12736 996 pts/0 S+ 22:18 0:00 \_ grep --color=auto celery
4. Look at the status API and see that processes are reported as still running
[vagrant@dev ~]$ pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: scheduler@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:22:28Z
_id: resource_manager@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-1@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-2@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-0@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
_id: reserved_resource_worker-3@dev
_ns: workers
Last Heartbeat: 2016-12-16T22:23:03Z
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.10.3b2
I think the main deficiency here is that generally everywhere that reads the Worker records considers things alive if the Worker records are present or not. The timestamps are ignored almost everywhere. Currently only thing that actually checks timestamps is the worker watcher thread in pulp_celerybeat which if you kill that first the records never get cleaned up.
Recently, we've planned to teach pulp-manage-db to look at the timestamps to evaluate for itself if there are still workers running. This is a neat trick because now pulp-manage-db is not dependant on the records being maintained by pulp_celerybeat. This same approach could be applied in the limited places that read workers so that even if the records are still present, once the timestamp ages out the record will be ignored. Specifically we could apply this to:
- The resource manager task which looks for workers
- The /status/ api view
I think ^ are the only places.
This leaves only 1 lingering problem, which is that pulp_celerybeat also writes the records. I think it would be better if all workers wrote their own records to the db instead of sending them through the message bus. Without this the pulp-manage-db waiting for workers could continue mistakenly if pulp-manage-db was killed even though workers are still running. This is because their timestamps would not be updated even though they are still heartbeating. I could even file this as a separate bug if someone encourage me to. It technically is a separate problem.
- Related to Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running added
Since we want to do the same query in multiple places, it's worth considering a custom ```QuerySetManager``` on the Worker model that applies the right filters. That would let us define "the rules" for identifying live workers in one place, near the model itself.
mhrivnak: that is a great idea; let's do that.
- Sprint/Milestone set to 31
- Triaged changed from No to Yes
- Status changed from NEW to ASSIGNED
- Assignee set to jortel@redhat.com
- Status changed from ASSIGNED to NEW
- Assignee deleted (
jortel@redhat.com)
- Related to Story #2519: Enable workers to record their own heartbeat records to the database added
- Sprint/Milestone changed from 31 to 32
- Status changed from NEW to ASSIGNED
- Assignee set to daviddavis
What sort of time limit do we want to use here (i.e. after how many seconds can assume the process died)?
As of 2.12, any worker who's check-in timestamp is older than 25 seconds should be considered missing.
Additionally, I think this is blocked on 2519. I should have put that on earlier.
- Related to deleted (Story #2519: Enable workers to record their own heartbeat records to the database)
- Blocked by Story #2519: Enable workers to record their own heartbeat records to the database added
- Status changed from ASSIGNED to NEW
- Assignee deleted (
daviddavis)
Unassigning myself since this is not ready.
- Status changed from NEW to ASSIGNED
- Assignee set to dkliban@redhat.com
- Status changed from ASSIGNED to POST
- Status changed from POST to MODIFIED
- Platform Release set to 2.13.0
- Status changed from MODIFIED to 5
verified
[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery beat'
Killed
[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery worker'
Killed
[root@ibm-x3550m3-09 ~]# ps -awfux | grep celery
root 6039 0.0 0.0 112648 960 pts/0 S+ 14:16 0:00 \_ grep --color=auto celery
[root@ibm-x3550m3-09 ~]# pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: reserved_resource_worker-0@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-4@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-3@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-6@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-2@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-5@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-17@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-7@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-14@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-8@ibm-x3550m3-09.lab.eng.brq.redhat.c
om
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-11@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-18@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-12@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-16@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-15@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-20@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-21@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-19@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-13@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:41Z
_id: reserved_resource_worker-22@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: reserved_resource_worker-23@ibm-x3550m3-09.lab.eng.brq.redhat.
com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:40Z
_id: resource_manager@ibm-x3550m3-09.lab.eng.brq.redhat.com
_ns: workers
Last Heartbeat: 2017-04-19T12:16:42Z
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.13b1
[root@ibm-x3550m3-09 ~]# pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.13b1
[root@ibm-x3550m3-09 ~]#
- Status changed from 5 to CLOSED - CURRENTRELEASE
- Sprint changed from Sprint 16 to Sprint 14
- Sprint/Milestone deleted (
32)
Also available in: Atom
PDF
Problem: Stale worker documents present in the db
Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:
Status API
Resource manager code that looks for available workers
closes #2496 https://pulp.plan.io/issues/2496