Story #2172
closedMemory Improvements with Process Recycling
0%
Description
Hi,
pulp need a lot of memory for copy repo (see https://pulp.plan.io/issues/1779).
After the copy task, pulp still need the memory (+6GB for some workers), even the pulp is complete idle.
Reduce memory leaks, the pulp worker should run the taks in a fork. If the task is done, the fork can be exited to reduce the memory.
Related issues
Updated by mhrivnak over 8 years ago
The cPython interpreter likes to hold on to memory even when the code it is running (pulp) is no longer using it. Improvements were made in python 3 (.3 I think?), but it's still not perfect.
As you point out, the only way to guarantee that a task returns all its memory is to start a new process for it. We could consider making celery do that for some or all task types.
Of course prevention is also hugely valuable, so we will continue to identify and fix specific areas of code that cause a spike in memory use.
I suggest converting this issue into a Story or Refactor that asks for each pulp task to be run in a one-time-use process. We can continue tracking specific memory use problems as separate bugs.
Updated by mhrivnak over 8 years ago
Celery has a setting to replace worker processes after some number of tasks, which could be a simple and effective approach.
http://docs.celeryproject.org/en/latest/userguide/workers.html#max-tasks-per-child-setting
Updated by amacdona@redhat.com over 8 years ago
- Tracker changed from Issue to Task
- Subject changed from Reduce memory leaks to Investigate memory usage and optimization
Updated by bmbouter over 8 years ago
I agree we should investigate that worker option. A lot of people use it for this reason. It will cause a lot more connection churn to mongo and qpid as Pulp runs, but that is probably ok as long as the count-before-restart is not too low.
If we do go to enable it, we should set the task count to not less than 2. Almost all Pulp tasks are really two celery tasks, the task itself (task A) and a task run just after to release the reservation for the task A.
It's possible that the connection to Qpid could be recycled when this restart occurs. If so we will loose all tasks (issue #489) in the worker's dedicated queue with each restart. This will be noticed immediately since release_resource tasks will likely be lost which will cause tasks to hang quickly. FYI as something to be aware of.
Updated by bmbouter over 8 years ago
I tested this with 1000 tasks and it worked very well. I was investigating a qpid bug[0] and I wanted this regular forking behavior for testing. I can say that Pulp worked very well with this so I think it would be safe to enable.
One thing to be aware of is that the resource_manager also uses this setting, which I didn't expect but I think that is ok. I spent some time trying to detect if it was the resource manager at configuration time to not enable this for the resource manager, but it was not possible that I saw without much effort.
diff --git a/server/pulp/server/async/celery_instance.py b/server/pulp/server/async/celery_instance.py
index 2a80fd3..693cd46 100644
--- a/server/pulp/server/async/celery_instance.py
+++ b/server/pulp/server/async/celery_instance.py
@@ -46,6 +46,7 @@ celery.conf.update(CELERYBEAT_SCHEDULER='pulp.server.async.scheduler.Scheduler')
celery.conf.update(CELERY_WORKER_DIRECT=True)
celery.conf.update(CELERY_TASK_SERIALIZER='json')
celery.conf.update(CELERY_ACCEPT_CONTENT=['json'])
+celery.conf.update(CELERYD_MAX_TASKS_PER_CHILD=2)
def configure_login_method():
Updated by mhrivnak over 8 years ago
This could be set on just the regular workers by using the "–maxtasksperchild" command line argument here:
https://github.com/pulp/pulp/blob/0ec80d17/server/pulp/server/async/manage_workers.py#L25
Celery can also load config values from environment variables, but you have to call a specific function with the name of the setting. We could do that, and then only set the environment variable for worker processes.
For long-term testing, I'd be inclined to set the value low just to exercise it and make sure any problems get noticed. For production use, I'd be more inclined to raise the value so the recycling doesn't cause a lot of resource churn. So making it configurable would be ideal.
Not wanting to rock the 2.y boat, I'd make this a fairly low-priority pulp 3 improvement that we could make after we get to a minimum-viable-product with postgres.
Updated by bmbouter over 8 years ago
- Sprint Candidate changed from No to Yes
- Tags Pulp 3 added
mhrivnak wrote:
This could be set on just the regular workers by using the "–maxtasksperchild" command line argument here:
https://github.com/pulp/pulp/blob/0ec80d17/server/pulp/server/async/manage_workers.py#L25
+1 to doing it this way.
Enabling it with an argument sounds good since it would just apply to the pulp_workers. The link you gave is only for systemd, we also need to do it for upstart too so that would be here also: https://github.com/pulp/pulp/blob/d1735ec63109d19705b36e03c036521cd7126a19/server/etc/rc.d/init.d/pulp_workers
Celery can also load config values from environment variables, but you have to call a specific function with the name of the setting. We could do that, and then only set the environment variable for worker processes.
This probably won't resolve the issue because pulp_workers and pulp_resource_manager won't be able to conditionally switch the arguments they are passing to this special function. Doing it with arguments to celeryd is probably the best way.
For long-term testing, I'd be inclined to set the value low just to exercise it and make sure any problems get noticed. For production use, I'd be more inclined to raise the value so the recycling doesn't cause a lot of resource churn. So making it configurable would be ideal.
+1 to making it configurable with a low default
Not wanting to rock the 2.y boat, I'd make this a fairly low-priority pulp 3 improvement that we could make after we get to a minimum-viable-product with postgres.
+1 to making it on pulp 3 and not in 2.y.
I'll suggest to do this sooner in the pulp 3 development rather than later. (1) because it won't be hard and (2) we can all derive the testing benefit as pulp 3 is developed.
Updated by bmbouter over 8 years ago
I added some checklist items for this story. What else needs to be done to groom this?
Updated by jokroepke over 8 years ago
Hi,
Pulp 3 looks so far away. There a chance to add this line provided by bmbouter into pulp 2.x, masked as experimental? It does not have to be well done.
Updated by mhrivnak over 8 years ago
bmbouter wrote:
mhrivnak wrote:
This could be set on just the regular workers by using the "–maxtasksperchild" command line argument here:
https://github.com/pulp/pulp/blob/0ec80d17/server/pulp/server/async/manage_workers.py#L25
+1 to doing it this way.
Enabling it with an argument sounds good since it would just apply to the pulp_workers. The link you gave is only for systemd, we also need to do it for upstart too so that would be here also: https://github.com/pulp/pulp/blob/d1735ec63109d19705b36e03c036521cd7126a19/server/etc/rc.d/init.d/pulp_workers
Not if we wait for pulp 3 to do it. :)
Celery can also load config values from environment variables, but you have to call a specific function with the name of the setting. We could do that, and then only set the environment variable for worker processes.
This probably won't resolve the issue because pulp_workers and pulp_resource_manager won't be able to conditionally switch the arguments they are passing to this special function. Doing it with arguments to celeryd is probably the best way.
They could all load the setting, but if we define it in /etc/default/pulp_workers, then only the worker processes will see the value.
Updated by mhrivnak over 8 years ago
jokroepke wrote:
Hi,
Pulp 3 looks so far away. There a chance to add this line provided by bmbouter into pulp 2.x, masked as experimental? It does not have to be well done.
The best way to make that happen would be to have a pull request submitted from the community. As long as the default behavior doesn't change, I think we could accept this onto a 2.y.
If it was sufficient to only work with systemd, the environment variable route would probably be easiest to implement. We could then consider moving it to pulp's own configuration (or not) in 3.0. It would be a 1-line addition here to load the setting from and env var:
https://github.com/pulp/pulp/blob/master/server/pulp/server/async/celery_instance.py#L49
and then add one line plus a comment in /etc/default/pulp_workers
Updated by bmbouter about 8 years ago
The environment variable approach seems to do more than we want. Adding a hard coded value to the systemd definition would be easy and that file already allows for user configuration here: https://github.com/pulp/pulp/blob/master/server/etc/default/systemd_pulp_workers
If the 2.y default was to be off by default we could add there easily or pull it in from server.conf.
Updated by jokroepke about 8 years ago
Could be this good enough? https://github.com/pulp/pulp/compare/master...jkroepke:issue_2172
Updated by mhrivnak about 8 years ago
bmbouter wrote:
The environment variable approach seems to do more than we want.
Can you elaborate on what you have in mind? I'm just not following.
It seems like both options involve setting an environment variable in /etc/default/pulp_workers. Then it's a question of: do we use that value to add an argument to the command line when starting the worker, or do we let celery read the value during app initialization. Either approach is very simple and likely a 1 or 2 line change.
Updated by bmbouter about 8 years ago
Using environment variables uses a layer of indirection that isn't necessary. It would set the variable and then code comes along later and causes a behavior based on that. Instead it would be more direct to have the thing starting the worker init script and worker systemd unit file specify the argument directly. It also has the tertiary benefit of being able to see if the behavior is present or not using `ps -awfux`.
I posted my outline for how to fix the PR here: https://github.com/pulp/pulp/pull/2723#issuecomment-243904063
Added by Jan-Otto Kröpke about 8 years ago
Added by Jan-Otto Kröpke about 8 years ago
Revision e62f93da | View on GitHub
Refactor Implement PULP_MAX_TASKS_PER_CHILD from https://github.com/pulp/pulp/pull/2723
Updated by bmbouter about 8 years ago
- Status changed from NEW to POST
- Assignee set to jokroepke
- Platform Release set to 2.11.0
- Tags deleted (
Pulp 3)
PR available at: https://github.com/pulp/pulp/pull/2723
Updated by bmbouter about 8 years ago
- Related to Story #2371: Use process recycling by default added
Updated by bmbouter about 8 years ago
Another PR which completes this issue: https://github.com/pulp/pulp/pull/2803
Added by bmbouter about 8 years ago
Revision f66bc778 | View on GitHub
Fixes to the process recycling feature
- updates the configs for upstart and systemd
- updates the systemd unit manager
- updates the upstart script
- adds a release note
- adds docs on the feature
- adds a troubleshooting note
- updates unit tests
I hand tested these against systemd and upstart so I expect them to work.
https://pulp.plan.io/issues/2172 closes 2172
Updated by bmbouter about 8 years ago
QE, to test this you should do the following:
test1. verify the feature works on systemd:
1. stop all pulp workers
2. uncomment the PULP_MAX_TASKS_PER_CHILD setting so that the value is equal to 2
3. start pulp and do some basic sync regression testing
4. verify that the pulp workers (not resource_manager, not celerybeat) all have the --maxtasksperchild=2 argument showing on the process listing. I use `ps -awfux | grep celery` to show the processes.
test2. Verify the feature works on upstart
same thing as on systemd, except you restart the processes like you would on upstart
test3. verify that an upgrade on upstart does have the section containing PULP_MAX_TASKS_PER_CHILD in /etc/default/pulp_workers after upgrade.
test4. verify that an upgrade on systemd does have the section containing PULP_MAX_TASKS_PER_CHILD in /etc/default/pulp_workers after upgrade.
test5. test that a commented out value for PULP_MAX_TASKS_PER_CHILD causes upstart processes to not receive the --maxtasksperchild=2 argument on pulp_workers. Only pulp_workers, not pulp_resource_manager or pulp_celerybeat. They are not involved in this feature.
test6. same as test5, but on systemd
Updated by bmbouter about 8 years ago
- Tracker changed from Task to Story
- Subject changed from Investigate memory usage and optimization to Memory Improvements with Process Recycling
Updated by pthomas@redhat.com almost 8 years ago
- Status changed from 5 to 6
Verified
Followed the tests from https://pulp.plan.io/issues/2172#note-21
Updated by pcreech almost 8 years ago
- Status changed from 6 to CLOSED - CURRENTRELEASE
Refactor Implement PULP_MAX_TASKS_PER_CHILD from https://github.com/pulp/pulp/pull/2723#issuecomment-243904063