Issue #2496: Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running - Pulp

Actions

Send by e-mail Copy link

Issue #2496

closed

Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running

Added by bmbouter over 7 years ago. Updated about 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

Normal

Assignee:

dkliban@redhat.com

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

2. Medium

Version:

Platform Release:

2.13.0

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Sprint 14

Quarter:

Description

To reproduce:

1. start with a running healthy pulp system
2. kill pulp_celerybeat, pulp_resource_manager, and pulp_workers in that order

[vagrant@dev ~]$ sudo pkill -9 -f 'celery beat'
[vagrant@dev ~]$ sudo pkill -9 -f 'celery worker'

3. Verify they are stopped

[vagrant@dev ~]$ ps -awfux | grep celery
vagrant   1154  0.0  0.0  12736   996 pts/0    S+   22:18   0:00              \_ grep --color=auto celery

4. Look at the status API and see that processes are reported as still running

[vagrant@dev ~]$ pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            scheduler@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:22:28Z
  _id:            resource_manager@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-1@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-2@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-0@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-3@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.10.3b2

Related issues

Actions

Copy link

Updated by bmbouter over 7 years ago

I think the main deficiency here is that generally everywhere that reads the Worker records considers things alive if the Worker records are present or not. The timestamps are ignored almost everywhere. Currently only thing that actually checks timestamps is the worker watcher thread in pulp_celerybeat which if you kill that first the records never get cleaned up.

Recently, we've planned to teach pulp-manage-db to look at the timestamps to evaluate for itself if there are still workers running. This is a neat trick because now pulp-manage-db is not dependant on the records being maintained by pulp_celerybeat. This same approach could be applied in the limited places that read workers so that even if the records are still present, once the timestamp ages out the record will be ignored. Specifically we could apply this to:

The resource manager task which looks for workers
The /status/ api view

I think ^ are the only places.

This leaves only 1 lingering problem, which is that pulp_celerybeat also writes the records. I think it would be better if all workers wrote their own records to the db instead of sending them through the message bus. Without this the pulp-manage-db waiting for workers could continue mistakenly if pulp-manage-db was killed even though workers are still running. This is because their timestamps would not be updated even though they are still heartbeating. I could even file this as a separate bug if someone encourage me to. It technically is a separate problem.

Actions

Copy link

Updated by bmbouter over 7 years ago

Related to Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running added

Actions

Copy link

Updated by mhrivnak over 7 years ago

Since we want to do the same query in multiple places, it's worth considering a custom ```QuerySetManager``` on the Worker model that applies the right filters. That would let us define "the rules" for identifying live workers in one place, near the model itself.

Actions

Copy link

Updated by bmbouter over 7 years ago

mhrivnak: that is a great idea; let's do that.

Actions

Copy link

Updated by bizhang over 7 years ago

Sprint/Milestone set to 31

Actions

Copy link

Updated by bizhang over 7 years ago

Triaged changed from No to Yes

Actions

Copy link

Updated by jortel@redhat.com over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to jortel@redhat.com

Actions

Copy link

Updated by jortel@redhat.com over 7 years ago

Status changed from ASSIGNED to NEW
Assignee deleted (~~jortel@redhat.com~~)

Actions

Copy link

#10

Updated by dkliban@redhat.com over 7 years ago

I added a new issue https://pulp.plan.io/issues/2519 to address last part of https://pulp.plan.io/issues/2496#note-1

Actions

Copy link

#11

Updated by dkliban@redhat.com over 7 years ago

Related to Story #2519: Enable workers to record their own heartbeat records to the database added

Actions

Copy link

#12

Updated by mhrivnak over 7 years ago

Sprint/Milestone changed from 31 to 32

Actions

Copy link

#13

Updated by daviddavis over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to daviddavis

Actions

Copy link

#14

Updated by daviddavis over 7 years ago

What sort of time limit do we want to use here (i.e. after how many seconds can assume the process died)?

Actions

Copy link

#15

Updated by bmbouter over 7 years ago

As of 2.12, any worker who's check-in timestamp is older than 25 seconds should be considered missing.

Additionally, I think this is blocked on 2519. I should have put that on earlier.

Actions

Copy link

#16

Updated by bmbouter over 7 years ago

Related to deleted (Story #2519: Enable workers to record their own heartbeat records to the database)

Actions

Copy link

#17

Updated by bmbouter over 7 years ago

Blocked by Story #2519: Enable workers to record their own heartbeat records to the database added

Actions

Copy link

#18

Updated by daviddavis over 7 years ago

Status changed from ASSIGNED to NEW
Assignee deleted (~~daviddavis~~)

Unassigning myself since this is not ready.

Actions

Copy link

#19

Updated by dkliban@redhat.com about 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to dkliban@redhat.com

Actions

Copy link

#20

Updated by dkliban@redhat.com about 7 years ago

Status changed from ASSIGNED to POST

https://github.com/pulp/pulp/pull/2920

Added by dkliban@redhat.com about 7 years ago

Revision d92a1dea | View on GitHub

Problem: Stale worker documents present in the db

Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:

Status API
Resource manager code that looks for available workers

closes #2496 https://pulp.plan.io/issues/2496

Added by dkliban@redhat.com about 7 years ago

Revision d92a1dea | View on GitHub

Problem: Stale worker documents present in the db

Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:

Status API
Resource manager code that looks for available workers

closes #2496 https://pulp.plan.io/issues/2496

Actions

Copy link

#22

Updated by dkliban@redhat.com about 7 years ago

Status changed from POST to MODIFIED

Applied in changeset pulp|d92a1deada56788bf862b26e61dc9b0a9027e2ff.

Actions

Copy link

#23

Updated by semyers about 7 years ago

Platform Release set to 2.13.0

Actions

Copy link

#24

Updated by pcreech about 7 years ago

Status changed from MODIFIED to 5

Actions

Copy link

#25

Updated by pthomas@redhat.com almost 7 years ago

verified

[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery beat'
Killed
[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery worker'
Killed
[root@ibm-x3550m3-09 ~]#  ps -awfux | grep celery
root      6039  0.0  0.0 112648   960 pts/0    S+   14:16   0:00          \_ grep --color=auto celery
[root@ibm-x3550m3-09 ~]#  pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            reserved_resource_worker-0@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-4@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-3@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-6@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-2@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-5@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-17@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-7@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-14@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-8@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-11@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-18@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-12@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-16@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-15@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-20@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-21@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-19@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-13@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-22@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-23@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            resource_manager@ibm-x3550m3-09.lab.eng.brq.redhat.com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:42Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.13b1

[root@ibm-x3550m3-09 ~]#  pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.13b1

[root@ibm-x3550m3-09 ~]#

Actions

Copy link

#26