Project

Profile

Help

Issue #2496

Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running

Added by bmbouter about 5 years ago. Updated almost 3 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
2.13.0
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 14
Quarter:

Description

To reproduce:

1. start with a running healthy pulp system
2. kill pulp_celerybeat, pulp_resource_manager, and pulp_workers in that order

[vagrant@dev ~]$ sudo pkill -9 -f 'celery beat'
[vagrant@dev ~]$ sudo pkill -9 -f 'celery worker'

3. Verify they are stopped

[vagrant@dev ~]$ ps -awfux | grep celery
vagrant   1154  0.0  0.0  12736   996 pts/0    S+   22:18   0:00              \_ grep --color=auto celery

4. Look at the status API and see that processes are reported as still running

[vagrant@dev ~]$ pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            scheduler@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:22:28Z
  _id:            resource_manager@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-1@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-2@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-0@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
  _id:            reserved_resource_worker-3@dev
  _ns:            workers
  Last Heartbeat: 2016-12-16T22:23:03Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.10.3b2

Related issues

Related to Pulp - Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as runningCLOSED - CURRENTRELEASE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>
Blocked by Pulp - Story #2519: Enable workers to record their own heartbeat records to the databaseCLOSED - CURRENTRELEASE

<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision d92a1dea View on GitHub
Added by dkliban@redhat.com almost 5 years ago

Problem: Stale worker documents present in the db

Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:

  • Status API

  • Resource manager code that looks for available workers

closes #2496 https://pulp.plan.io/issues/2496

Revision d92a1dea View on GitHub
Added by dkliban@redhat.com almost 5 years ago

Problem: Stale worker documents present in the db

Solution: A custom queryset for the Worker model allows filtering out worker records which have not been updated in more than 25 seconds. This queryset is used in two places:

  • Status API

  • Resource manager code that looks for available workers

closes #2496 https://pulp.plan.io/issues/2496

History

#1 Updated by bmbouter about 5 years ago

I think the main deficiency here is that generally everywhere that reads the Worker records considers things alive if the Worker records are present or not. The timestamps are ignored almost everywhere. Currently only thing that actually checks timestamps is the worker watcher thread in pulp_celerybeat which if you kill that first the records never get cleaned up.

Recently, we've planned to teach pulp-manage-db to look at the timestamps to evaluate for itself if there are still workers running. This is a neat trick because now pulp-manage-db is not dependant on the records being maintained by pulp_celerybeat. This same approach could be applied in the limited places that read workers so that even if the records are still present, once the timestamp ages out the record will be ignored. Specifically we could apply this to:

  • The resource manager task which looks for workers
  • The /status/ api view

I think ^ are the only places.

This leaves only 1 lingering problem, which is that pulp_celerybeat also writes the records. I think it would be better if all workers wrote their own records to the db instead of sending them through the message bus. Without this the pulp-manage-db waiting for workers could continue mistakenly if pulp-manage-db was killed even though workers are still running. This is because their timestamps would not be updated even though they are still heartbeating. I could even file this as a separate bug if someone encourage me to. It technically is a separate problem.

#2 Updated by bmbouter about 5 years ago

  • Related to Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running added

#4 Updated by mhrivnak about 5 years ago

Since we want to do the same query in multiple places, it's worth considering a custom ```QuerySetManager``` on the Worker model that applies the right filters. That would let us define "the rules" for identifying live workers in one place, near the model itself.

#5 Updated by bmbouter about 5 years ago

mhrivnak: that is a great idea; let's do that.

#6 Updated by bizhang about 5 years ago

  • Sprint/Milestone set to 31

#7 Updated by bizhang about 5 years ago

  • Triaged changed from No to Yes

#8 Updated by jortel@redhat.com about 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to jortel@redhat.com

#9 Updated by jortel@redhat.com about 5 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (jortel@redhat.com)

#11 Updated by dkliban@redhat.com about 5 years ago

  • Related to Story #2519: Enable workers to record their own heartbeat records to the database added

#12 Updated by mhrivnak about 5 years ago

  • Sprint/Milestone changed from 31 to 32

#13 Updated by daviddavis about 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to daviddavis

#14 Updated by daviddavis about 5 years ago

What sort of time limit do we want to use here (i.e. after how many seconds can assume the process died)?

#15 Updated by bmbouter about 5 years ago

As of 2.12, any worker who's check-in timestamp is older than 25 seconds should be considered missing.

Additionally, I think this is blocked on 2519. I should have put that on earlier.

#16 Updated by bmbouter about 5 years ago

  • Related to deleted (Story #2519: Enable workers to record their own heartbeat records to the database)

#17 Updated by bmbouter about 5 years ago

  • Blocked by Story #2519: Enable workers to record their own heartbeat records to the database added

#18 Updated by daviddavis about 5 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (daviddavis)

Unassigning myself since this is not ready.

#19 Updated by dkliban@redhat.com almost 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to dkliban@redhat.com

#20 Updated by dkliban@redhat.com almost 5 years ago

  • Status changed from ASSIGNED to POST

#22 Updated by dkliban@redhat.com almost 5 years ago

  • Status changed from POST to MODIFIED

#23 Updated by semyers almost 5 years ago

  • Platform Release set to 2.13.0

#24 Updated by pcreech almost 5 years ago

  • Status changed from MODIFIED to 5

#25 Updated by pthomas@redhat.com almost 5 years ago

verified

[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery beat'
Killed
[root@ibm-x3550m3-09 ~]# sudo pkill -9 -f 'celery worker'
Killed
[root@ibm-x3550m3-09 ~]#  ps -awfux | grep celery
root      6039  0.0  0.0 112648   960 pts/0    S+   14:16   0:00          \_ grep --color=auto celery
[root@ibm-x3550m3-09 ~]#  pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            reserved_resource_worker-0@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-4@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-3@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-6@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-2@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-5@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-17@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-7@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-14@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-8@ibm-x3550m3-09.lab.eng.brq.redhat.c
                  om
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-11@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-18@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-12@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-16@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-15@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-20@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-21@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-19@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-13@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:41Z
  _id:            reserved_resource_worker-22@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            reserved_resource_worker-23@ibm-x3550m3-09.lab.eng.brq.redhat.
                  com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:40Z
  _id:            resource_manager@ibm-x3550m3-09.lab.eng.brq.redhat.com
  _ns:            workers
  Last Heartbeat: 2017-04-19T12:16:42Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.13b1

[root@ibm-x3550m3-09 ~]#  pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.13b1

[root@ibm-x3550m3-09 ~]# 

#26 Updated by pcreech over 4 years ago

  • Status changed from 5 to CLOSED - CURRENTRELEASE

#27 Updated by bmbouter almost 4 years ago

  • Sprint set to Sprint 16

#28 Updated by bmbouter almost 4 years ago

  • Sprint changed from Sprint 16 to Sprint 14

#29 Updated by bmbouter almost 4 years ago

  • Sprint/Milestone deleted (32)

#30 Updated by bmbouter almost 3 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF