Project

Profile

Help

Issue #2496

closed

Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running

Added by bmbouter over 7 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.13.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 14
Quarter:

Description

To reproduce:

1. start with a running healthy pulp system
2. kill pulp_celerybeat, pulp_resource_manager, and pulp_workers in that order

[vagrant@dev ~]$ sudo pkill -9 -f 'celery beat'
[vagrant@dev ~]$ sudo pkill -9 -f 'celery worker'

3. Verify they are stopped

[vagrant@dev ~]$ ps -awfux | grep celery
vagrant   1154  0.0  0.0  12736   996 pts/0    S+   22:18   0:00              \_ grep --color=auto celery

4. Look at the status API and see that processes are reported as still running

[vagrant@dev ~]$ pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            scheduler@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:22:28Z
  _id:            resource_manager@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-1@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-2@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-0@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-3@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.10.3b2

Related issues

Related to Pulp - Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as runningCLOSED - CURRENTRELEASEdaviddavisActions
Blocked by Pulp - Story #2519: Enable workers to record their own heartbeat records to the databaseCLOSED - CURRENTRELEASEdalley

Actions
Actions #1

Updated by bmbouter over 7 years ago

I think the main deficiency here is that generally everywhere that reads the Worker records considers things alive if the Worker records are present or not. The timestamps are ignored almost everywhere. Currently only thing that actually checks timestamps is the worker watcher thread in pulp_celerybeat which if you kill that first the records never get cleaned up.

Recently, we've planned to teach pulp-manage-db to look at the timestamps to evaluate for itself if there are still workers running. This is a neat trick because now pulp-manage-db is not dependant on the records being maintained by pulp_celerybeat. This same approach could be applied in the limited places that read workers so that even if the records are still present, once the timestamp ages out the record will be ignored. Specifically we could apply this to:

  • The resource manager task which looks for workers
  • The /status/ api view

I think ^ are the only places.

This leaves only 1 lingering problem, which is that pulp_celerybeat also writes the records. I think it would be better if all workers wrote their own records to the db instead of sending them through the message bus. Without this the pulp-manage-db waiting for workers could continue mistakenly if pulp-manage-db was killed even though workers are still running. This is because their timestamps would not be updated even though they are still heartbeating. I could even file this as a separate bug if someone encourage me to. It technically is a separate problem.

Actions #2

Updated by bmbouter over 7 years ago

  • Related to Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running added
Actions #4

Updated by mhrivnak over 7 years ago

Since we want to do the same query in multiple places, it's worth considering a custom ```QuerySetManager``` on the Worker model that applies the right filters. That would let us define "the rules" for identifying live workers in one place, near the model itself.

Actions #5

Updated by bmbouter over 7 years ago

mhrivnak: that is a great idea; let's do that.

Actions #6

Updated by bizhang over 7 years ago

  • Sprint/Milestone set to 31
Actions #7

Updated by bizhang over 7 years ago

  • Triaged changed from No to Yes
Actions #8

Updated by jortel@redhat.com over 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to jortel@redhat.com
Actions #9

Updated by jortel@redhat.com over 7 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (jortel@redhat.com)
Actions #11

Updated by dkliban@redhat.com over 7 years ago

  • Related to Story #2519: Enable workers to record their own heartbeat records to the database added
Actions #12

Updated by mhrivnak over 7 years ago

  • Sprint/Milestone changed from 31 to 32
Actions #13

Updated by daviddavis over 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis
Actions #14

Updated by daviddavis over 7 years ago

What sort of time limit do we want to use here (i.e. after how many seconds can assume the process died)?

Actions #15

Updated by bmbouter over 7 years ago

As of 2.12, any worker who's check-in timestamp is older than 25 seconds should be considered missing.

Additionally, I think this is blocked on 2519. I should have put that on earlier.

Actions #16

Updated by bmbouter over 7 years ago

  • Related to deleted (Story #2519: Enable workers to record their own heartbeat records to the database)
Actions #17

Updated by bmbouter over 7 years ago

  • Blocked by Story #2519: Enable workers to record their own heartbeat records to the database added
Actions #18

Updated by daviddavis over 7 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (daviddavis)

Unassigning myself since this is not ready.

Actions #19

Updated by dkliban@redhat.com about 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com
Actions #20

Updated by dkliban@redhat.com about 7 years ago

  • Status changed from ASSIGNED to POST

Added by dkliban@redhat.com about 7 years ago

Revision d92a1dea | View on GitHub

Problem: Stale worker documents present in the db

Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:

  • Status API

  • Resource manager code that looks for available workers

closes #2496 https://pulp.plan.io/issues/2496

Added by dkliban@redhat.com about 7 years ago

Revision d92a1dea | View on GitHub

Problem: Stale worker documents present in the db

Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:

  • Status API

  • Resource manager code that looks for available workers

closes #2496 https://pulp.plan.io/issues/2496

Actions #22

Updated by dkliban@redhat.com about 7 years ago

  • Status changed from POST to MODIFIED
Actions #23

Updated by semyers about 7 years ago

  • Platform Release set to 2.13.0
Actions #24

Updated by pcreech about 7 years ago

  • Status changed from MODIFIED to 5
Actions #25

Updated by pthomas@redhat.com almost 7 years ago

verified

[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery beat'
Killed
[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery worker'
Killed
[root@ibm-x3550m3-09 ~]#  ps -awfux | grep celery
root      6039  0.0  0.0 112648   960 pts/0    S+   14:16   0:00          \_ grep --color=auto celery
[root@ibm-x3550m3-09 ~]#  pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            reserved_resource_worker-0@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-4@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-3@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-6@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-2@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-5@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-17@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-7@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-14@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-8@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-11@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-18@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-12@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-16@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-15@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-20@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-21@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-19@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-13@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-22@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-23@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            resource_manager@ibm-x3550m3-09.lab.eng.brq.redhat.com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:42Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.13b1

[root@ibm-x3550m3-09 ~]#  pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.13b1

[root@ibm-x3550m3-09 ~]# 
Actions #26

Updated by pcreech almost 7 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE
Actions #27

Updated by bmbouter about 6 years ago

  • Sprint set to Sprint 16
Actions #28

Updated by bmbouter about 6 years ago

  • Sprint changed from Sprint 16 to Sprint 14
Actions #29

Updated by bmbouter about 6 years ago

  • Sprint/Milestone deleted (32)
Actions #30

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF