Issue #2491
closedWhen stopping pulp_workers, pulp_celerybeat, and pulp_resource_manager gracefully, the status API still shows them as running
Description
To reproduce:
1. start with a running healthy pulp system
2. stop pulp_celerybeat, pulp_resource_manager, and pulp_workers (IN THAT ORDER)
[vagrant@dev ~]$ sudo systemctl stop pulp_celerybeat
[vagrant@dev ~]$ sudo systemctl stop pulp_workers
[vagrant@dev ~]$ sudo systemctl stop pulp_resource_manager
3. Verify they are stopped
[vagrant@dev ~]$ ps -awfux | grep celery
vagrant 1154 0.0 0.0 12736 996 pts/0 S+ 22:18 0:00 \_ grep --color=auto celery
4. Look at the status API and see that processes are reported as still running
[vagrant@dev ~]$ pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: scheduler@dev
_ns: workers
Last Heartbeat: 2016-12-14T22:16:51Z
_id: resource_manager@dev
_ns: workers
Last Heartbeat: 2016-12-14T22:16:52Z
_id: reserved_resource_worker-0@dev
_ns: workers
Last Heartbeat: 2016-12-14T22:16:52Z
_id: reserved_resource_worker-3@dev
_ns: workers
Last Heartbeat: 2016-12-14T22:16:52Z
_id: reserved_resource_worker-2@dev
_ns: workers
Last Heartbeat: 2016-12-14T22:16:52Z
_id: reserved_resource_worker-1@dev
_ns: workers
Last Heartbeat: 2016-12-14T22:16:53Z
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.10.3b2
Related issues
Updated by mhrivnak over 7 years ago
As an option, maybe this feature of celery could be useful to us:
Right now, we have a data quality problem where entries in the Workers collection cannot be trusted to represent running processes. It will take some work, and at least one release/upgrade cycle, to get that straightened out. It also could be a complex change, since we have a delicate concurrency challenge of who owns that data and is allowed to change it. If we allow each worker process to modify or delete its own entry during shutdown, we have to be very careful that doesn't violate an assumption somewhere else.
In the mean time, we could shift our thinking, and consider that collection to be a list of workers that are known to have been active recently, but nothing more.
The status API could send a broadcast ping and ensure it gets responses from all entries in the database before returning them. That should resolve the problem of the status API misrepresenting worker state.
Updated by bizhang over 7 years ago
- Priority changed from Normal to High
- Sprint/Milestone set to 31
- Severity changed from 2. Medium to 3. High
- Triaged changed from No to Yes
Updated by daviddavis over 7 years ago
One thing to note: Katello calls the status API before it performs any action in Pulp (see https://git.io/v1Q14) so performance of the status check should be a factor in whatever we decide to do.
Updated by bmbouter over 7 years ago
The ping idea would be easy enough, but I think it would be better to do a few other things instead.
pulp_celerybeat doesn't respond to ping, which I confirmed with this little test:
[vagrant@dev ~]$ sudo -u apache python
Python 2.7.12 (default, Sep 29 2016, 13:30:34)
[GCC 6.2.1 20160916 (Red Hat 6.2.1-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pulp.server.async.celery_instance import celery
>>> from celery.app.control import Control
>>> c = Control(app=celery)
>>> c.ping()
[{u'reserved_resource_worker-1@dev': {u'ok': u'pong'}}, {u'reserved_resource_worker-0@dev': {u'ok': u'pong'}}, {u'reserved_resource_worker-2@dev': {u'ok': u'pong'}}, {u'reserved_resource_worker-3@dev': {u'ok': u'pong'}}, {u'resource_manager@dev': {u'ok': u'pong'}}]
Also I think /status/ should show the state of Pulp as it believes to be. Many areas of Pulp work out of these records so I think keeping /status/ as a way to inspect the current state of these records is good. Also, I imagine cases where Pulp would loose track of a worker due to a bug and yes ping() would work.
One other aspect is that we should think about the state of these records in two cases: graceful versus non-graceful shutdown of processes. I propose this ticket be dedicated to graceful shutdown only. I have a few specific ideas about hardening the non-graceful case, but I propose I write those into a separate ticket and also relate it with the the Katello bugzilla.
So for graceful shutdown, I think we just need to do a few things:
- add the close method here[0] which fixed it for pulp_celerybeat
- investigate why this codepath[1] isn't handling to for the resource_manager itself
- probably rework worker record deletion to not use [1] and instead use a celery shutdown signal[2]
I don't think this bug is concerned with the upgrade case. We can deliver a fix that will allow graceful shutdown to work. After the fix is applied to the running processes, then if they are gracefully shutdown, only then would I expect the /status/ call to be accurate so in that way I think it's a normal bugfix.
[0]: https://github.com/pulp/pulp/pull/2781/files#diff-80e8b96df1f5da9a551bb6ff18dea352R373
[1]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L106
[2]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L36
Updated by mhrivnak over 7 years ago
bmbouter wrote:
The ping idea would be easy enough, but I think it would be better to do a few other things instead.
pulp_celerybeat doesn't respond to ping, which I confirmed with this little test:
bummer
Also I think /status/ should show the state of Pulp as it believes to be. Many areas of Pulp work out of these records so I think keeping /status/ as a way to inspect the current state of these records is good. Also, I imagine cases where Pulp would loose track of a worker due to a bug and yes ping() would work.
I'm not trying to push the ping idea, but I do want to point out that all of these things can be achieved together. The /status/ API call could show each worker record, when its last heartbeat was, and whether or not it responded to a ping.
One other aspect is that we should think about the state of these records in two cases: graceful versus non-graceful shutdown of processes. I propose this ticket be dedicated to graceful shutdown only. I have a few specific ideas about hardening the non-graceful case, but I propose I write those into a separate ticket and also relate it with the the Katello bugzilla.
So for graceful shutdown, I think we just need to do a few things:
- add the close method here[0] which fixed it for pulp_celerybeat
- investigate why this codepath[1] isn't handling to for the resource_manager itself
- probably rework worker record deletion to not use [1] and instead use a celery shutdown signal[2]
I agree that if if graceful shutdown of a process can lead to that process removing its own record, that is ideal. We just need to be confident that's a safe change to make. If you're happy with it, I'm happy!
I don't think this bug is concerned with the upgrade case. We can deliver a fix that will allow graceful shutdown to work. After the fix is applied to the running processes, then if they are gracefully shutdown, only then would I expect the /status/ call to be accurate so in that way I think it's a normal bugfix.
I'm not sure what you're getting at with this paragraph.
[0]: https://github.com/pulp/pulp/pull/2781/files#diff-80e8b96df1f5da9a551bb6ff18dea352R373
[1]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L106
[2]: https://github.com/pulp/pulp/blob/33ce8590f6112c2101930e23f0470c1883ef98a7/server/pulp/server/async/app.py#L36
Updated by bmbouter over 7 years ago
I think we can get the graceful shutdown working for all of those processes. I can help support @daviddavis who mentioned wanting to take it as assigned.
The upgrade case paragraph doesn't really make much sense I agree. I think with all the recent discussion about the worker records being leftover as being an issue for upgrades with pulp-manage-db, I wanted to identify the separation of concerns. In re-reading what I wrote, that didn't really come out clearly.
So are we good to proceed with the signal based implementation? I think it's something we can fix and it won't disturb any assumptions in other areas of the code.
Updated by daviddavis over 7 years ago
- Status changed from NEW to ASSIGNED
- Assignee set to daviddavis
Updated by bmbouter over 7 years ago
- Related to Issue #2496: Killing pulp_workers, pulp_celerybeat, and pulp_resource_manager causes the status API still shows them as running added
Updated by daviddavis over 7 years ago
- Status changed from ASSIGNED to POST
Added by daviddavis over 7 years ago
Added by daviddavis over 7 years ago
Revision 95246c57 | View on GitHub
Fix status API to not show stopped workers and scheduler
This fix ensures that workers (including resource_manager) and the scheduler properly clean up their database records when gracefully shutdown.
Updated by daviddavis over 7 years ago
- Status changed from POST to MODIFIED
Applied in changeset pulp|95246c572c0d5ef04c08daafa2554179ebbb6bfb.
Updated by pthomas@redhat.com over 7 years ago
verified
[root@qe-blade-05 ~]# rpm -qa |grep pulp
python-kombu-3.0.33-6.pulp.el7.noarch
python-pulp-puppet-common-2.11.1-0.1.beta.el7.noarch
python-pulp-repoauth-2.11.1-0.1.beta.el7.noarch
pulp-puppet-plugins-2.11.1-0.1.beta.el7.noarch
python-pulp-client-lib-2.11.1-0.1.beta.el7.noarch
pulp-docker-admin-extensions-2.2.0-1.el7.noarch
pulp-python-admin-extensions-1.1.3-1.el7.noarch
python-pulp-streamer-2.11.1-0.1.beta.el7.noarch
python-pulp-common-2.11.1-0.1.beta.el7.noarch
python-pulp-rpm-common-2.11.1-0.1.beta.el7.noarch
pulp-rpm-plugins-2.11.1-0.1.beta.el7.noarch
pulp-rpm-admin-extensions-2.11.1-0.1.beta.el7.noarch
python-pulp-python-common-1.1.3-1.el7.noarch
python-isodate-0.5.0-4.pulp.el7.noarch
python-pulp-docker-common-2.2.0-1.el7.noarch
pulp-selinux-2.11.1-0.1.beta.el7.noarch
pulp-docker-plugins-2.2.0-1.el7.noarch
pulp-admin-client-2.11.1-0.1.beta.el7.noarch
python-pulp-ostree-common-1.2.0-1.el7.noarch
pulp-ostree-admin-extensions-1.2.0-1.el7.noarch
python-pulp-oid_validation-2.11.1-0.1.beta.el7.noarch
pulp-server-2.11.1-0.1.beta.el7.noarch
python-pulp-bindings-2.11.1-0.1.beta.el7.noarch
pulp-puppet-admin-extensions-2.11.1-0.1.beta.el7.noarch
pulp-ostree-plugins-1.2.0-1.el7.noarch
pulp-python-plugins-1.1.3-1.el7.noarch
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# pulp-admin login -u admin -p admin
Successfully logged in. Session certificate will expire at Jan 17 17:05:50 2017
GMT.
[root@qe-blade-05 ~]# pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
_id: reserved_resource_worker-1@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: reserved_resource_worker-0@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: reserved_resource_worker-4@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: reserved_resource_worker-6@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: reserved_resource_worker-2@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: reserved_resource_worker-5@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: reserved_resource_worker-3@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: reserved_resource_worker-7@qe-blade-05.idmqe.lab.eng.bos.redha
t.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:48Z
_id: scheduler@qe-blade-05.idmqe.lab.eng.bos.redhat.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:50Z
_id: resource_manager@qe-blade-05.idmqe.lab.eng.bos.redhat.com
_ns: workers
Last Heartbeat: 2017-01-10T17:05:24Z
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.11.1b1
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# sudo systemctl stop pulp_celerybeat
[root@qe-blade-05 ~]# sudo systemctl stop pulp_workers
[root@qe-blade-05 ~]# sudo systemctl stop pulp_resource_manager
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# ps -awfux | grep celery
root 2218 0.0 0.0 112648 948 pts/0 S+ 12:06 0:00 \_ grep --color=auto celery
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]#
[root@qe-blade-05 ~]# pulp-admin status
+----------------------------------------------------------------------+
Status of the server
+----------------------------------------------------------------------+
Api Version: 2
Database Connection:
Connected: True
Known Workers:
Messaging Connection:
Connected: True
Versions:
Platform Version: 2.11.1b1
Updated by semyers over 7 years ago
- Status changed from 6 to CLOSED - CURRENTRELEASE
Fix status API to not show stopped workers and scheduler
This fix ensures that workers (including resource_manager) and the scheduler properly clean up their database records when gracefully shutdown.
fixes #2491 https://pulp.plan.io/issues/2491