Project

Profile

Help

Story #2172

Memory Improvements with Process Recycling

Added by jokroepke over 3 years ago. Updated 7 months ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
% Done:

0%

Platform Release:
2.11.0
Blocks Release:
Backwards Incompatible:
No
Groomed:
No
Sprint Candidate:
Yes
Tags:
Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
Yes
Verification Required:
No
Sprint:

Description

Hi,

pulp need a lot of memory for copy repo (see https://pulp.plan.io/issues/1779).

After the copy task, pulp still need the memory (+6GB for some workers), even the pulp is complete idle.

Reduce memory leaks, the pulp worker should run the taks in a fork. If the task is done, the fork can be exited to reduce the memory.


Checklist


Related issues

Related to Pulp - Story #2371: Use process recycling by default CLOSED - WONTFIX Actions

Associated revisions

Revision 989ba05b View on GitHub
Added by Jan-Otto Kröpke about 3 years ago

Refactor Implement PULP_MAX_TASKS_PER_CHILD
from https://github.com/pulp/pulp/pull/2723#issuecomment-243904063

Revision e62f93da View on GitHub
Added by Jan-Otto Kröpke about 3 years ago

Refactor Implement PULP_MAX_TASKS_PER_CHILD
from https://github.com/pulp/pulp/pull/2723

Revision f66bc778 View on GitHub
Added by bmbouter about 3 years ago

Fixes to the process recycling feature

- updates the configs for upstart and systemd
- updates the systemd unit manager
- updates the upstart script
- adds a release note
- adds docs on the feature
- adds a troubleshooting note
- updates unit tests

I hand tested these against systemd and upstart so I
expect them to work.

https://pulp.plan.io/issues/2172
closes 2172

History

#1 Updated by mhrivnak over 3 years ago

The cPython interpreter likes to hold on to memory even when the code it is running (pulp) is no longer using it. Improvements were made in python 3 (.3 I think?), but it's still not perfect.

As you point out, the only way to guarantee that a task returns all its memory is to start a new process for it. We could consider making celery do that for some or all task types.

Of course prevention is also hugely valuable, so we will continue to identify and fix specific areas of code that cause a spike in memory use.

I suggest converting this issue into a Story or Refactor that asks for each pulp task to be run in a one-time-use process. We can continue tracking specific memory use problems as separate bugs.

#2 Updated by mhrivnak over 3 years ago

Celery has a setting to replace worker processes after some number of tasks, which could be a simple and effective approach.

http://docs.celeryproject.org/en/latest/userguide/workers.html#max-tasks-per-child-setting

#3 Updated by amacdona@redhat.com over 3 years ago

  • Tracker changed from Issue to Task
  • Subject changed from Reduce memory leaks to Investigate memory usage and optimization

#4 Updated by bmbouter over 3 years ago

I agree we should investigate that worker option. A lot of people use it for this reason. It will cause a lot more connection churn to mongo and qpid as Pulp runs, but that is probably ok as long as the count-before-restart is not too low.

If we do go to enable it, we should set the task count to not less than 2. Almost all Pulp tasks are really two celery tasks, the task itself (task A) and a task run just after to release the reservation for the task A.

It's possible that the connection to Qpid could be recycled when this restart occurs. If so we will loose all tasks (issue #489) in the worker's dedicated queue with each restart. This will be noticed immediately since release_resource tasks will likely be lost which will cause tasks to hang quickly. FYI as something to be aware of.

#5 Updated by bmbouter over 3 years ago

I tested this with 1000 tasks and it worked very well. I was investigating a qpid bug0 and I wanted this regular forking behavior for testing. I can say that Pulp worked very well with this so I think it would be safe to enable.

One thing to be aware of is that the resource_manager also uses this setting, which I didn't expect but I think that is ok. I spent some time trying to detect if it was the resource manager at configuration time to not enable this for the resource manager, but it was not possible that I saw without much effort.

diff --git a/server/pulp/server/async/celery_instance.py b/server/pulp/server/async/celery_instance.py
index 2a80fd3..693cd46 100644
--- a/server/pulp/server/async/celery_instance.py
+++ b/server/pulp/server/async/celery_instance.py
@@ -46,6 +46,7 @@ celery.conf.update(CELERYBEAT_SCHEDULER='pulp.server.async.scheduler.Scheduler')
 celery.conf.update(CELERY_WORKER_DIRECT=True)
 celery.conf.update(CELERY_TASK_SERIALIZER='json')
 celery.conf.update(CELERY_ACCEPT_CONTENT=['json'])
+celery.conf.update(CELERYD_MAX_TASKS_PER_CHILD=2)

 def configure_login_method():

#6 Updated by mhrivnak about 3 years ago

This could be set on just the regular workers by using the "–maxtasksperchild" command line argument here:

https://github.com/pulp/pulp/blob/0ec80d17/server/pulp/server/async/manage_workers.py#L25

Celery can also load config values from environment variables, but you have to call a specific function with the name of the setting. We could do that, and then only set the environment variable for worker processes.

For long-term testing, I'd be inclined to set the value low just to exercise it and make sure any problems get noticed. For production use, I'd be more inclined to raise the value so the recycling doesn't cause a lot of resource churn. So making it configurable would be ideal.

Not wanting to rock the 2.y boat, I'd make this a fairly low-priority pulp 3 improvement that we could make after we get to a minimum-viable-product with postgres.

#7 Updated by bmbouter about 3 years ago

  • Sprint Candidate changed from No to Yes
  • Tags Pulp 3 added

mhrivnak wrote:

This could be set on just the regular workers by using the "–maxtasksperchild" command line argument here:

https://github.com/pulp/pulp/blob/0ec80d17/server/pulp/server/async/manage_workers.py#L25

+1 to doing it this way.

Enabling it with an argument sounds good since it would just apply to the pulp_workers. The link you gave is only for systemd, we also need to do it for upstart too so that would be here also: https://github.com/pulp/pulp/blob/d1735ec63109d19705b36e03c036521cd7126a19/server/etc/rc.d/init.d/pulp_workers

Celery can also load config values from environment variables, but you have to call a specific function with the name of the setting. We could do that, and then only set the environment variable for worker processes.

This probably won't resolve the issue because pulp_workers and pulp_resource_manager won't be able to conditionally switch the arguments they are passing to this special function. Doing it with arguments to celeryd is probably the best way.

For long-term testing, I'd be inclined to set the value low just to exercise it and make sure any problems get noticed. For production use, I'd be more inclined to raise the value so the recycling doesn't cause a lot of resource churn. So making it configurable would be ideal.

+1 to making it configurable with a low default

Not wanting to rock the 2.y boat, I'd make this a fairly low-priority pulp 3 improvement that we could make after we get to a minimum-viable-product with postgres.

+1 to making it on pulp 3 and not in 2.y.

I'll suggest to do this sooner in the pulp 3 development rather than later. (1) because it won't be hard and (2) we can all derive the testing benefit as pulp 3 is developed.

#8 Updated by bmbouter about 3 years ago

  • Checklist item Add –maxtasksperchild = 2 to pulp_workers systemd definition added
  • Checklist item Add –maxtasksperchild = 2 to pulp_workers upstart file definition added
  • Checklist item Add release note about change added
  • Checklist item Add configuration option which will specifies the value and defaults to 2 added

#9 Updated by bmbouter about 3 years ago

I added some checklist items for this story. What else needs to be done to groom this?

#10 Updated by jokroepke about 3 years ago

Hi,

Pulp 3 looks so far away. There a chance to add this line provided by bmbouter into pulp 2.x, masked as experimental? It does not have to be well done.

#11 Updated by mhrivnak about 3 years ago

bmbouter wrote:

mhrivnak wrote:

This could be set on just the regular workers by using the "–maxtasksperchild" command line argument here:

https://github.com/pulp/pulp/blob/0ec80d17/server/pulp/server/async/manage_workers.py#L25

+1 to doing it this way.

Enabling it with an argument sounds good since it would just apply to the pulp_workers. The link you gave is only for systemd, we also need to do it for upstart too so that would be here also: https://github.com/pulp/pulp/blob/d1735ec63109d19705b36e03c036521cd7126a19/server/etc/rc.d/init.d/pulp_workers

Not if we wait for pulp 3 to do it. :)

Celery can also load config values from environment variables, but you have to call a specific function with the name of the setting. We could do that, and then only set the environment variable for worker processes.

This probably won't resolve the issue because pulp_workers and pulp_resource_manager won't be able to conditionally switch the arguments they are passing to this special function. Doing it with arguments to celeryd is probably the best way.

They could all load the setting, but if we define it in /etc/default/pulp_workers, then only the worker processes will see the value.

#12 Updated by mhrivnak about 3 years ago

jokroepke wrote:

Hi,

Pulp 3 looks so far away. There a chance to add this line provided by bmbouter into pulp 2.x, masked as experimental? It does not have to be well done.

The best way to make that happen would be to have a pull request submitted from the community. As long as the default behavior doesn't change, I think we could accept this onto a 2.y.

If it was sufficient to only work with systemd, the environment variable route would probably be easiest to implement. We could then consider moving it to pulp's own configuration (or not) in 3.0. It would be a 1-line addition here to load the setting from and env var:

https://github.com/pulp/pulp/blob/master/server/pulp/server/async/celery_instance.py#L49

and then add one line plus a comment in /etc/default/pulp_workers

#13 Updated by bmbouter about 3 years ago

The environment variable approach seems to do more than we want. Adding a hard coded value to the systemd definition would be easy and that file already allows for user configuration here: https://github.com/pulp/pulp/blob/master/server/etc/default/systemd_pulp_workers

If the 2.y default was to be off by default we could add there easily or pull it in from server.conf.

#15 Updated by mhrivnak about 3 years ago

bmbouter wrote:

The environment variable approach seems to do more than we want.

Can you elaborate on what you have in mind? I'm just not following.

It seems like both options involve setting an environment variable in /etc/default/pulp_workers. Then it's a question of: do we use that value to add an argument to the command line when starting the worker, or do we let celery read the value during app initialization. Either approach is very simple and likely a 1 or 2 line change.

#16 Updated by bmbouter about 3 years ago

Using environment variables uses a layer of indirection that isn't necessary. It would set the variable and then code comes along later and causes a behavior based on that. Instead it would be more direct to have the thing starting the worker init script and worker systemd unit file specify the argument directly. It also has the tertiary benefit of being able to see if the behavior is present or not using `ps -awfux`.

I posted my outline for how to fix the PR here: https://github.com/pulp/pulp/pull/2723#issuecomment-243904063

#17 Updated by bmbouter about 3 years ago

  • Checklist item deleted (Add configuration option which will specifies the value and defaults to 2)
  • Checklist item Add –maxtasksperchild = 2 to pulp_workers systemd definition set to Done
  • Checklist item Add –maxtasksperchild = 2 to pulp_workers upstart file definition set to Done
  • Status changed from NEW to POST
  • Assignee set to jokroepke
  • Platform Release set to 2.11.0
  • Tags deleted (Pulp 3)

#18 Updated by bmbouter about 3 years ago

  • Related to Story #2371: Use process recycling by default added

#19 Updated by bmbouter about 3 years ago

Another PR which completes this issue: https://github.com/pulp/pulp/pull/2803

#20 Updated by bmbouter about 3 years ago

  • Checklist item Add release note about change set to Done

#21 Updated by bmbouter about 3 years ago

QE, to test this you should do the following:

test1. verify the feature works on systemd:
1. stop all pulp workers
2. uncomment the PULP_MAX_TASKS_PER_CHILD setting so that the value is equal to 2
3. start pulp and do some basic sync regression testing
4. verify that the pulp workers (not resource_manager, not celerybeat) all have the --maxtasksperchild=2 argument showing on the process listing. I use `ps -awfux | grep celery` to show the processes.

test2. Verify the feature works on upstart
same thing as on systemd, except you restart the processes like you would on upstart

test3. verify that an upgrade on upstart does have the section containing PULP_MAX_TASKS_PER_CHILD in /etc/default/pulp_workers after upgrade.

test4. verify that an upgrade on systemd does have the section containing PULP_MAX_TASKS_PER_CHILD in /etc/default/pulp_workers after upgrade.

test5. test that a commented out value for PULP_MAX_TASKS_PER_CHILD causes upstart processes to not receive the --maxtasksperchild=2 argument on pulp_workers. Only pulp_workers, not pulp_resource_manager or pulp_celerybeat. They are not involved in this feature.

test6. same as test5, but on systemd

#22 Updated by jokroepke about 3 years ago

Thanks for complete this feature.

#24 Updated by bmbouter about 3 years ago

  • Status changed from POST to MODIFIED

#25 Updated by semyers about 3 years ago

  • Status changed from MODIFIED to ON_QA

#26 Updated by bmbouter about 3 years ago

  • Tracker changed from Task to Story
  • Subject changed from Investigate memory usage and optimization to Memory Improvements with Process Recycling

#28 Updated by pthomas@redhat.com almost 3 years ago

  • Status changed from ON_QA to VERIFIED

Verified

Followed the tests from https://pulp.plan.io/issues/2172#note-21

#30 Updated by pcreech almost 3 years ago

  • Status changed from VERIFIED to CLOSED - CURRENTRELEASE

#31 Updated by pulpbot almost 3 years ago

  • Verified changed from No to Yes

#32 Updated by bmbouter 7 months ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF