Issue #2491: When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running - Pulp

Actions

Send by e-mail Copy link

Issue #2491

closed

When stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running

Added by bmbouter over 7 years ago. Updated about 5 years ago.

Status:

CLOSED - CURRENTRELEASE

Priority:

High

Assignee:

daviddavis

Category:

Sprint/Milestone:

Start date:

Due date:

Estimated time:

Severity:

3. High

Version:

Platform Release:

2.11.1

OS:

Triaged:

Yes

Groomed:

Sprint Candidate:

Tags:

Pulp 2

Sprint:

Sprint 13

Quarter:

Description

To reproduce:

1. start with a running healthy pulp system
2. stop pulp_celerybeat, pulp_resource_manager, and pulp_workers (IN THAT ORDER)

[vagrant@dev ~]$ sudo systemctl stop pulp_celerybeat
[vagrant@dev ~]$ sudo systemctl stop pulp_workers
[vagrant@dev ~]$ sudo systemctl stop pulp_resource_manager

3. Verify they are stopped

[vagrant@dev ~]$ ps -awfux | grep celery
vagrant   1154  0.0  0.0  12736   996 pts/0    S+   22:18   0:00              \_ grep --color=auto celery

4. Look at the status API and see that processes are reported as still running

[vagrant@dev ~]$ pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:   
  Connected: True
Known Workers:         
  _id:            scheduler@dev
  _ns:            workers
  Last Heartbeat: 2016-12-14T22:16:51Z
  _id:            resource_manager@dev
  _ns:            workers
  Last Heartbeat: 2016-12-14T22:16:52Z
  _id:            reserved_resource_worker-0@dev
  _ns:            workers
  Last Heartbeat: 2016-12-14T22:16:52Z
  _id:            reserved_resource_worker-3@dev
  _ns:            workers
  Last Heartbeat: 2016-12-14T22:16:52Z
  _id:            reserved_resource_worker-2@dev
  _ns:            workers
  Last Heartbeat: 2016-12-14T22:16:52Z
  _id:            reserved_resource_worker-1@dev
  _ns:            workers
  Last Heartbeat: 2016-12-14T22:16:53Z
Messaging Connection:  
  Connected: True
Versions:              
  Platform Version: 2.10.3b2

Related issues

Actions

Copy link

Updated by mhrivnak over 7 years ago

As an option, maybe this feature of celery could be useful to us:

http://docs.celeryproject.org/en/3.1/reference/celery.app.control.html#celery.app.control.Control.ping

Right now, we have a data quality problem where entries in the Workers collection cannot be trusted to represent running processes. It will take some work, and at least one release/upgrade cycle, to get that straightened out. It also could be a complex change, since we have a delicate concurrency challenge of who owns that data and is allowed to change it. If we allow each worker process to modify or delete its own entry during shutdown, we have to be very careful that doesn't violate an assumption somewhere else.

In the mean time, we could shift our thinking, and consider that collection to be a list of workers that are known to have been active recently, but nothing more.

The status API could send a broadcast ping and ensure it gets responses from all entries in the database before returning them. That should resolve the problem of the status API misrepresenting worker state.

Actions

Copy link

Updated by bizhang over 7 years ago

Priority changed from Normal to High
Sprint/Milestone set to 31
Severity changed from 2. Medium to 3. High
Triaged changed from No to Yes

Actions

Copy link

Updated by daviddavis over 7 years ago

One thing to note: Katello calls the status API before it performs any action in Pulp (see https://git.io/v1Q14) so performance of the status check should be a factor in whatever we decide to do.

Actions

Copy link

Updated by bmbouter over 7 years ago

The ping idea would be easy enough, but I think it would be better to do a few other things instead.

pulp_celerybeat doesn't respond to ping, which I confirmed with this little test:

[vagrant@dev ~]$ sudo -u apache python
Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
[GCC 6.2.1 20160916 (Red Hat 6.2.1-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pulp.server.async.celery_instance import celery
>>> from celery.app.control import Control
>>> c = Control(app=celery)
>>> c.ping()
[{u'reserved_resource_worker-1@dev': {u'ok': u'pong'}}, {u'reserved_resource_worker-0@dev': {u'ok': u'pong'}}, {u'reserved_resource_worker-2@dev': {u'ok': u'pong'}}, {u'reserved_resource_worker-3@dev': {u'ok': u'pong'}}, {u'resource_manager@dev': {u'ok': u'pong'}}]

Also I think /status/ should show the state of Pulp as it believes to be. Many areas of Pulp work out of these records so I think keeping /status/ as a way to inspect the current state of these records is good. Also, I imagine cases where Pulp would loose track of a worker due to a bug and yes ping() would work.

One other aspect is that we should think about the state of these records in two cases: graceful versus non-graceful shutdown of processes. I propose this ticket be dedicated to graceful shutdown only. I have a few specific ideas about hardening the non-graceful case, but I propose I write those into a separate ticket and also relate it with the the Katello bugzilla.

So for graceful shutdown, I think we just need to do a few things:

add the close method here[0] which fixed it for pulp_celerybeat
investigate why this codepath[1] isn't handling to for the resource_manager itself
probably rework worker record deletion to not use [1] and instead use a celery shutdown signal[2]

I don't think this bug is concerned with the upgrade case. We can deliver a fix that will allow graceful shutdown to work. After the fix is applied to the running processes, then if they are gracefully shutdown, only then would I expect the /status/ call to be accurate so in that way I think it's a normal bugfix.

[0]: https://github.com/pulp/pulp/pull/2781/files#diff-80e8b96df1f5da9a551bb6ff18dea352R373
[1]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L106
[2]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L36

Actions

Copy link

Updated by mhrivnak over 7 years ago

bmbouter wrote:

The ping idea would be easy enough, but I think it would be better to do a few other things instead.

pulp_celerybeat doesn't respond to ping, which I confirmed with this little test:

bummer

Also I think /status/ should show the state of Pulp as it believes to be. Many areas of Pulp work out of these records so I think keeping /status/ as a way to inspect the current state of these records is good. Also, I imagine cases where Pulp would loose track of a worker due to a bug and yes ping() would work.

I'm not trying to push the ping idea, but I do want to point out that all of these things can be achieved together. The /status/ API call could show each worker record, when its last heartbeat was, and whether or not it responded to a ping.

One other aspect is that we should think about the state of these records in two cases: graceful versus non-graceful shutdown of processes. I propose this ticket be dedicated to graceful shutdown only. I have a few specific ideas about hardening the non-graceful case, but I propose I write those into a separate ticket and also relate it with the the Katello bugzilla.

So for graceful shutdown, I think we just need to do a few things:

add the close method here[0] which fixed it for pulp_celerybeat

investigate why this codepath[1] isn't handling to for the resource_manager itself

probably rework worker record deletion to not use [1] and instead use a celery shutdown signal[2]

I agree that if if graceful shutdown of a process can lead to that process removing its own record, that is ideal. We just need to be confident that's a safe change to make. If you're happy with it, I'm happy!

I don't think this bug is concerned with the upgrade case. We can deliver a fix that will allow graceful shutdown to work. After the fix is applied to the running processes, then if they are gracefully shutdown, only then would I expect the /status/ call to be accurate so in that way I think it's a normal bugfix.

I'm not sure what you're getting at with this paragraph.

[0]: https://github.com/pulp/pulp/pull/2781/files#diff-80e8b96df1f5da9a551bb6ff18dea352R373
[1]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L106
[2]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L36

Actions

Copy link

Updated by bmbouter over 7 years ago

I think we can get the graceful shutdown working for all of those processes. I can help support @daviddavis who mentioned wanting to take it as assigned.

The upgrade case paragraph doesn't really make much sense I agree. I think with all the recent discussion about the worker records being leftover as being an issue for upgrades with pulp-manage-db, I wanted to identify the separation of concerns. In re-reading what I wrote, that didn't really come out clearly.

So are we good to proceed with the signal based implementation? I think it's something we can fix and it won't disturb any assumptions in other areas of the code.

Actions

Copy link

Updated by daviddavis over 7 years ago

Status changed from NEW to ASSIGNED
Assignee set to daviddavis

Actions

Copy link

Updated by bmbouter over 7 years ago

Related to Issue #2496: Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running added

Actions

Copy link

#10

Updated by daviddavis over 7 years ago

Status changed from ASSIGNED to POST

https://github.com/pulp/pulp/pull/2901

Added by daviddavis over 7 years ago

Revision 95246c57 | View on GitHub

Fix status API to not show stopped workers and scheduler

This fix ensures that workers (including resource_manager) and the scheduler properly clean up their database records when gracefully shutdown.

fixes #2491 https://pulp.plan.io/issues/2491

Added by daviddavis over 7 years ago

Revision 95246c57 | View on GitHub

Fix status API to not show stopped workers and scheduler

This fix ensures that workers (including resource_manager) and the scheduler properly clean up their database records when gracefully shutdown.

fixes #2491 https://pulp.plan.io/issues/2491

Actions

Copy link

#11

Updated by daviddavis over 7 years ago

Status changed from POST to MODIFIED

Applied in changeset pulp|95246c572c0d5ef04c08daafa2554179ebbb6bfb.

Actions

Copy link

#12

Updated by semyers over 7 years ago

Platform Release set to 2.11.1

Actions

Copy link

#13

Updated by semyers over 7 years ago

Status changed from MODIFIED to 5

Actions

Copy link

#14

Updated by pthomas@redhat.com over 7 years ago

verified

[root@qe-blade-05 ~]# rpm -qa |grep pulp
python-kombu-3.0.33-6.pulp.el7.noarch
python-pulp-puppet-common-2.11.1-0.1.beta.el7.noarch
python-pulp-repoauth-2.11.1-0.1.beta.el7.noarch
pulp-puppet-plugins-2.11.1-0.1.beta.el7.noarch
python-pulp-client-lib-2.11.1-0.1.beta.el7.noarch
pulp-docker-admin-extensions-2.2.0-1.el7.noarch
pulp-python-admin-extensions-1.1.3-1.el7.noarch
python-pulp-streamer-2.11.1-0.1.beta.el7.noarch
python-pulp-common-2.11.1-0.1.beta.el7.noarch
python-pulp-rpm-common-2.11.1-0.1.beta.el7.noarch
pulp-rpm-plugins-2.11.1-0.1.beta.el7.noarch
pulp-rpm-admin-extensions-2.11.1-0.1.beta.el7.noarch
python-pulp-python-common-1.1.3-1.el7.noarch
python-isodate-0.5.0-4.pulp.el7.noarch
python-pulp-docker-common-2.2.0-1.el7.noarch
pulp-selinux-2.11.1-0.1.beta.el7.noarch
pulp-docker-plugins-2.2.0-1.el7.noarch
pulp-admin-client-2.11.1-0.1.beta.el7.noarch
python-pulp-ostree-common-1.2.0-1.el7.noarch
pulp-ostree-admin-extensions-1.2.0-1.el7.noarch
python-pulp-oid_validation-2.11.1-0.1.beta.el7.noarch
pulp-server-2.11.1-0.1.beta.el7.noarch
python-pulp-bindings-2.11.1-0.1.beta.el7.noarch
pulp-puppet-admin-extensions-2.11.1-0.1.beta.el7.noarch
pulp-ostree-plugins-1.2.0-1.el7.noarch
pulp-python-plugins-1.1.3-1.el7.noarch
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# pulp-admin login -u admin -p admin
Successfully logged in. Session certificate will expire at Jan 17 17:05:50 2017
GMT.

[root@qe-blade-05 ~]# pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:
  Connected: True
Known Workers:
  _id:            reserved_resource_worker-1@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            reserved_resource_worker-0@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            reserved_resource_worker-4@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            reserved_resource_worker-6@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            reserved_resource_worker-2@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            reserved_resource_worker-5@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            reserved_resource_worker-3@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            reserved_resource_worker-7@qe-blade-05.idmqe.lab.eng.bos.redha
                  t.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:48Z
  _id:            scheduler@qe-blade-05.idmqe.lab.eng.bos.redhat.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:50Z
  _id:            resource_manager@qe-blade-05.idmqe.lab.eng.bos.redhat.com
  _ns:            workers
  Last Heartbeat: 2017-01-10T17:05:24Z
Messaging Connection:
  Connected: True
Versions:
  Platform Version: 2.11.1b1

[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# sudo systemctl stop pulp_celerybeat
[root@qe-blade-05 ~]# sudo systemctl stop pulp_workers
[root@qe-blade-05 ~]# sudo systemctl stop pulp_resource_manager
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# ps -awfux | grep celery
root      2218  0.0  0.0 112648   948 pts/0    S+   12:06   0:00          \_ grep --color=auto celery
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# pulp-admin status
+----------------------------------------------------------------------+
                          Status of the server
+----------------------------------------------------------------------+

Api Version:           2
Database Connection:
  Connected: True
Known Workers:
Messaging Connection:
  Connected: True
Versions:
  Platform Version: 2.11.1b1

Actions

Copy link

#15