Issue #3540
closedWhen pulp_workers are restarted, /etc/default/pulp_workers might be ignored
Description
When I restart the pulp_workers
service, I expect that the current settings /etc/default/pulp_workers
configuration file will be used. However, this is not always the case. In reality, it's quite possible to restart pulp_workers
and have new processes that make use of old settings.
This issue affects one Pulp Smash test in particular: pulp_smash.tests.pulp2.rpm.cli.test_process_recycling.MaxTasksPerChildTestCase. Here's a copy-paste from the test case description:
Test Pulp’s handling of its
PULP_MAX_TASKS_PER_CHILD
setting.The
PULP_MAX_TASKS_PER_CHILD
setting controls how many tasks a worker process executes before being destroyed. Setting this option to a low value, like 2, ensures that processes don’t have a chance to consume large amounts of memory.Test this feature by doing the following:
1. Use
ps
to verify that no Pulp worker processes have the--maxtasksperchild
option set.
2. SetPULP_MAX_TASKS_PER_CHILD
and restart Pulp. Useps
to verify that all Pulp worker processes were invoked with the--maxtasksperchild
option.
3. Execute a sync and publish. No errors should be reported.
4. Unset thePULP_MAX_TASKS_PER_CHILD
option and restart Pulp. Useps
to verify that no Pulp worker processes have the--maxtasksperchild
option set.For more information, see Pulp #2172.
Frequently, step 4 fails. The Pulp worker processes at the end of the test do have --maxtasksperchild
set. (Note that I will likely be adding a kludgy fix into the test case, with a reference to this issue. If you're trying to reproduce this issue, tweak the test code as appropriate.)
Jenkins nodes consistently reproduce this failure. As an example, 8 of the 10 most-recently-completed pulp-2.15-dev-f26 test runs were affected by this issue. (I'd love to give a more precise number, but Jenkins is so awfully slow that combing through results in a more precise manner is painful.) And the majority of all Pulp 2.15 and 2.16 test results are affected by this issue. I've also managed to reproduce this test failure by firing off a job, holding the host that Jenkins creates for the job, canceling the tests being executed by Jenkins, and then playing around with the Pulp installation on that host.
Unfortunately, it's very hard to reproduce this failure outside of Jenkins. On my own VM server, I've spun up a matrix of VMs and run the test against all the hosts in parallel 20 times. The test only failed once. (I don't recall the details of the test matrix, but one axis may have been Pulp 2.14 and 2.15, and the other may have been F25, F26, and RHEL 7.)
Why is this test so hard to reproduce outside of Jenkins? I'm unsure. My best guess is that the Jenkins hosts are sloooooow. My own VMs can complete the test in 0.5 - 0.75 minutes, whereas Jenkins hosts complete the test in 2 - 3.5 minutes. This slowness is a real issue. Logs from the Jenkins hosts show that the pulp_workers
process doesn't always restart cleanly. Here's a snippet from journalctl
:
Mar 27 19:20:57 host-172-16-46-33.openstacklocal pulp[6897]: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-0@host-172-16-46-33.openstacklocal' has gone missing, removing from list of workers
Mar 27 19:20:57 host-172-16-46-33.openstacklocal pulp[6897]: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-0@host-172-16-46-33.openstacklocal is missing. Canceling the tasks in its queue.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_workers.service: Stopping timed out. Terminating.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_workers.service: Failed with result 'timeout'.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_worker-0.service: State 'stop-sigterm' timed out. Killing.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_worker-0.service: Main process exited, code=killed, status=9/KILL
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_worker-0.service: Failed with result 'timeout'.
As you can see, some of the services hidden behind pulp-workers
fail to cleanly shut down when told to restart, and must be stopped with SIGKILL. (SIGKILL is sent by kill -9
. It can't be caught.)
Also of possible interest: pulp_workers.service
has a fishy implementation. It returns the status code of the last pulp-worker-X.service
that it touches. That's... overly optimistic.
I'm convinced that the fault here lies with some implementation detail of the Pulp workers, and that the fault isn't with other aspects of the system, such as NFS caching. Why? Well, consider the unsuccessful fixes I've implemented:
- After resetting the configuration file, execute
sync
,sync && cat /etc/default/pulp_workers
, orcat /etc/default/pulp_workers
. Ifpulp_workers
was reading a cached copy of/etc/default/pulp_workers
, then this should have fixed that. Furthermore, thecat ...
command showed that the file was successfully reset. - After resetting the configuration file, execute
systemctl stop pulp_workers && systemctl start pulp_workers
instead ofsystemctl restart pulp_workers
. If outdated files in the file descriptor store were causing issues, this would have fixed the issue.
Here's the fixes that did work:
- After resetting the configuration file, sleep for 30 seconds.
What might be going on here? The best theory I have is that, when a worker is killed with SIGKILL, a new one is immediately spawned by a Celery management process. For more on this theory, see celery #102. And again, I'd like to reiterate that pulp_workers.service
has a fishy implementation.