Project

Profile

Help

Issue #3540

closed

When pulp_workers are restarted, /etc/default/pulp_workers might be ignored

Added by Ichimonji10 almost 7 years ago. Updated over 5 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
2. Medium
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

When I restart the pulp_workers service, I expect that the current settings /etc/default/pulp_workers configuration file will be used. However, this is not always the case. In reality, it's quite possible to restart pulp_workers and have new processes that make use of old settings.

This issue affects one Pulp Smash test in particular: pulp_smash.tests.pulp2.rpm.cli.test_process_recycling.MaxTasksPerChildTestCase. Here's a copy-paste from the test case description:

Test Pulp’s handling of its PULP_MAX_TASKS_PER_CHILD setting.

The PULP_MAX_TASKS_PER_CHILD setting controls how many tasks a worker process executes before being destroyed. Setting this option to a low value, like 2, ensures that processes don’t have a chance to consume large amounts of memory.

Test this feature by doing the following:

1. Use ps to verify that no Pulp worker processes have the --maxtasksperchild option set.
2. Set PULP_MAX_TASKS_PER_CHILD and restart Pulp. Use ps to verify that all Pulp worker processes were invoked with the --maxtasksperchild option.
3. Execute a sync and publish. No errors should be reported.
4. Unset the PULP_MAX_TASKS_PER_CHILD option and restart Pulp. Use ps to verify that no Pulp worker processes have the --maxtasksperchild option set.

For more information, see Pulp #2172.

Frequently, step 4 fails. The Pulp worker processes at the end of the test do have --maxtasksperchild set. (Note that I will likely be adding a kludgy fix into the test case, with a reference to this issue. If you're trying to reproduce this issue, tweak the test code as appropriate.)

Jenkins nodes consistently reproduce this failure. As an example, 8 of the 10 most-recently-completed pulp-2.15-dev-f26 test runs were affected by this issue. (I'd love to give a more precise number, but Jenkins is so awfully slow that combing through results in a more precise manner is painful.) And the majority of all Pulp 2.15 and 2.16 test results are affected by this issue. I've also managed to reproduce this test failure by firing off a job, holding the host that Jenkins creates for the job, canceling the tests being executed by Jenkins, and then playing around with the Pulp installation on that host.

Unfortunately, it's very hard to reproduce this failure outside of Jenkins. On my own VM server, I've spun up a matrix of VMs and run the test against all the hosts in parallel 20 times. The test only failed once. (I don't recall the details of the test matrix, but one axis may have been Pulp 2.14 and 2.15, and the other may have been F25, F26, and RHEL 7.)

Why is this test so hard to reproduce outside of Jenkins? I'm unsure. My best guess is that the Jenkins hosts are sloooooow. My own VMs can complete the test in 0.5 - 0.75 minutes, whereas Jenkins hosts complete the test in 2 - 3.5 minutes. This slowness is a real issue. Logs from the Jenkins hosts show that the pulp_workers process doesn't always restart cleanly. Here's a snippet from journalctl:

Mar 27 19:20:57 host-172-16-46-33.openstacklocal pulp[6897]: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-0@host-172-16-46-33.openstacklocal' has gone missing, removing from list of workers
Mar 27 19:20:57 host-172-16-46-33.openstacklocal pulp[6897]: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-0@host-172-16-46-33.openstacklocal is missing. Canceling the tasks in its queue.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_workers.service: Stopping timed out. Terminating.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_workers.service: Failed with result 'timeout'.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_worker-0.service: State 'stop-sigterm' timed out. Killing.
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_worker-0.service: Main process exited, code=killed, status=9/KILL
Mar 27 19:21:58 host-172-16-46-33.openstacklocal systemd[1]: pulp_worker-0.service: Failed with result 'timeout'.

As you can see, some of the services hidden behind pulp-workers fail to cleanly shut down when told to restart, and must be stopped with SIGKILL. (SIGKILL is sent by kill -9. It can't be caught.)

Also of possible interest: pulp_workers.service has a fishy implementation. It returns the status code of the last pulp-worker-X.service that it touches. That's... overly optimistic.

I'm convinced that the fault here lies with some implementation detail of the Pulp workers, and that the fault isn't with other aspects of the system, such as NFS caching. Why? Well, consider the unsuccessful fixes I've implemented:

  • After resetting the configuration file, execute sync, sync && cat /etc/default/pulp_workers, or cat /etc/default/pulp_workers. If pulp_workers was reading a cached copy of /etc/default/pulp_workers, then this should have fixed that. Furthermore, the cat ... command showed that the file was successfully reset.
  • After resetting the configuration file, execute systemctl stop pulp_workers && systemctl start pulp_workers instead of systemctl restart pulp_workers. If outdated files in the file descriptor store were causing issues, this would have fixed the issue.

Here's the fixes that did work:

  • After resetting the configuration file, sleep for 30 seconds.

What might be going on here? The best theory I have is that, when a worker is killed with SIGKILL, a new one is immediately spawned by a Celery management process. For more on this theory, see celery #102. And again, I'd like to reiterate that pulp_workers.service has a fishy implementation.

Also available in: Atom PDF