Project

Profile

Help

Issue #2492

Small memory leak in httpd when dispatching Pulp tasks

Added by bmbouter almost 3 years ago. Updated 6 months ago.

Status:
NEW
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Severity:
1. Low
Version:
Platform Release:
Blocks Release:
OS:
Backwards Incompatible:
No
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
QA Contact:
Complexity:
Smash Test:
Verified:
No
Verification Required:
No
Sprint:

Description

standalone pulp reproducer:

name="some-very-long-repo-name-some-very-long-repo-name-some-very-long-repo-name-test" 

i=1;
while true; do
  pulp-admin rpm repo create --repo-id=${name}-${i} --display-name=${name}-${i}
  pulp-admin rpm repo delete --repo-id=${name}-${i} 
  sleep 2
done

(plus some ps monitoring)

Memory usage (RSS) grows approx 110bytes per one task /220bytes per one iteration of above cycle (sum of RSS increases per individual processes)

wsgi_memory_growth_10000.png (15.1 KB) bmbouter, 12/28/2016 06:51 PM wsgi_memory_growth_10000.png
uwsgi_10K_create_and_delete.png (8.87 KB) bmbouter, 01/11/2017 09:21 PM uwsgi_10K_create_and_delete.png
250
250

Related issues

Related to Pulp - Story #2632: As a developer I want to reevaluate worker issues to see if they have been resolved by moving from Celery3 to Celery4 CLOSED - WONTFIX Actions

History

#1 Updated by mhrivnak almost 3 years ago

We at least need to identify a root cause soon. If fixing is complex, we could consider a different priority for that.

#2 Updated by bizhang almost 3 years ago

  • Sprint/Milestone set to 31
  • Triaged changed from No to Yes

#3 Updated by bmbouter almost 3 years ago

  • Description updated (diff)
  • Sprint/Milestone deleted (31)

#4 Updated by bmbouter almost 3 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter

#5 Updated by bmbouter almost 3 years ago

  • File wsgi_memory_growth.png added

On a 2.10-dev installation of Pulp, I modified the httpd conf file to have exactly 1 WSGI process for Pulp which causes it to handle all requests.

I then ran this reproducer:

name="some-very-long-repo-name-some-very-long-repo-name-some-very-long-repo-name-test" 

for i in {0..1000}; do
  sudo ps -up $(pgrep -f 'wsgi:pulp\)')

  pulp-admin rpm repo create --repo-id=${name}-${i} --display-name=${name}-${i}
  pulp-admin rpm repo delete --repo-id=${name}-${i}
done

I then grab all of the process memory footprints into one text file with:

cat output.txt | grep FOREGROUND > filtered.txt
awk '{print $6}' ~/filtered.txt > ~/filtered_mem_only.txt

I then graph that using Google Docs and I get the attached figure.

#6 Updated by bmbouter almost 3 years ago

  • File deleted (wsgi_memory_growth.png)

#7 Updated by bmbouter almost 3 years ago

  • File wsgi_memory_growth.png added

I ran 3 additional test cases.

  • I reran the original reproducer, but using RabbitMQ instead of Qpid
  • using RabbitMQ, I reran the reproducer but only had it create the repos. This codepath does not involve Celery or the message bus.
  • using RabbitMQ, I reran the reproducer but only had it delete the repos. This codepath does involve Celery and the message bus.

All 4 test cases are in one figure for comparison. It includes the previous data also so I'm deleting the previous figure.

I think the data supports several conclusions:

1. The memory leak is not unique to Qpid
2. Some of the memory growth is not related to Celery, but after ~500 requests memory usage does not grow
3. The memory is related to Celery

I don't think we know if the memory growth will continue to grow monotonically with continued tasking system use, or if it will eventually stabilize.

#8 Updated by bmbouter almost 3 years ago

  • File wsgi_memory_growth_10000.png added

I ran the following 3 tests for 10K requests and attached the image.

  • create+delete using RabbitMQ
  • create only using RabbitMQ
  • delete only using RabbitMQ

btw, all of this data is recorded here0.

[0]: https://docs.google.com/a/redhat.com/spreadsheets/d/1N4Tm5uyblnHoBdy27n5_YohkfIQxkWzIr_kfZKF5n3Y/edit?usp=sharing

#9 Updated by bmbouter almost 3 years ago

  • File deleted (wsgi_memory_growth.png)

#10 Updated by bmbouter almost 3 years ago

  • File deleted (wsgi_memory_growth_10000.png)

#11 Updated by bmbouter almost 3 years ago

250

The man page states that the RSS column is in Kilobytes not bytes. Uploading a new figure with the correct axis label.

#12 Updated by bmbouter almost 3 years ago

From the long running tests, I think it shows a few things:

1. Celery memory usage for task dispatch contributes the most to the memory problems
2. Non-celery operations also increase memory usage but more slowly. httpd does seem to release this with a slight dip around 9K requests.
3. Memory growth seems to increase monotonically
4. The rate of worst-case growth (the red line) is approx. 0.2928 KB per create+destroy operation. That includes both the create and destroy operation.

I don't think this is severe enough to continue investigating. We should consider doing one of two things. We could make a change to the default WSGI config to enable process recycling, or we could close as WONTFIX.

What do others think?

#14 Updated by mhrivnak almost 3 years ago

I think process recycling is an appropriate change upstream.

The number of requests between restarts should be large so this doesn't have a big resource impact, especially since the memory leak is very small. Just making stuff up, I'd go with at least 10,000 requests between restarts. That would still keep the memory leak's impact very small.

I also thing the graceful-timeout should be large so there is ample opportunity for the process to do a graceful shutdown. There's no harm in letting that be at least several minutes.

http://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html

#15 Updated by dkliban@redhat.com almost 3 years ago

Is the memory leak present when httpd is handling requests for content? If so, then we should open up a bug against httpd.

If the memory leak is not present when httpd is handling requests for content, then we should consider how many Pulp API requests would need to be made in order to cause a significant increase in memory use. Based on the data provided in comment 12, it would take over 6800 API requests to cause httpd to use an extra 1 megabyte of memory. This is not a serious problem.

Introducing process recycling will only address the symptoms of the problem. It will also prevent us from noticing any other memory leaks that could be introduced in the future.

We should focus on fixing the problem. If we don't know how to fix it, we should at least leave it as is so someone else can attempt to fix it.

#16 Updated by bmbouter almost 3 years ago

After reading comments from others, and thinking more about this, I think we should leave this at NEW and remove it from the sprint. The main things that stand out in my mind are:

  • It allows for the potential early discovery of a more serious leak (memory, fd's, AIO contexts, whatever)
  • This leak is not severe so the benefit of making this change is very small.
  • It allows the issue to remain open for someone to really fix it with a tool like guppy

#17 Updated by mhrivnak almost 3 years ago

  • Sprint/Milestone set to 31

#18 Updated by bmbouter almost 3 years ago

  • File uwsgi_2K_create_and_delete.png added
  • Sprint/Milestone deleted (31)

I wanted to compare the memory behavior of httpd versus other webservers, so I ran the same reproducer with Pulp running against uWSGI and nginx. nginx used a fixed (literally 0 kb changes during the test) of 5292 KB. The uWSGI however leaked similarly to httpd. See the attached graph of the create+delete test using rabbitMQ with 2K requests.

#19 Updated by bmbouter almost 3 years ago

I added some to my reproducer with additional webserver support and the tracking of file descriptors. In terms of next steps, I plan to heavily comment out the codepaths of these and trying to observe what aspects of the request make additional. This was originally an idea from @mhrivnak.

name="some-very-long-repo-name-some-very-long-repo-name-some-very-long-repo-name-test" 

rm httpd_mem_data.txt
rm httpd_fd_data.txt

rm uwsgi_mem_data.txt
rm uwsgi_fd_data.txt

rm nginx_mem_data.txt
rm nginx_fd_data.txt

rm -rf ~/devel/mem_output/*
sudo rm /var/lib/pulp/count_output.txt

for i in {0..2000}; do
  #sudo ps --no-header -up $(pgrep -f 'wsgi:pulp\)') | awk '{ print $6 }' >> httpd_mem_data.txt
  #sudo lsof -p $(pgrep -f 'wsgi:pulp\)') | wc -l >> httpd_fd_data.txt

  sudo ps --no-header -up $(pgrep -f 'uwsgi pulp_uwsgi.ini' | tail -n1) | awk '{ print $6 }' >> uwsgi_mem_data.txt
  sudo lsof -p $(pgrep -f 'uwsgi pulp_uwsgi.ini' | tail -n1) | wc -l >> uwsgi_fd_data.txt

  sudo ps --no-header -up $(pgrep -f 'nginx: worker process') | awk '{ print $6 }' >> nginx_mem_data.txt
  sudo lsof -p $(pgrep -f 'nginx: worker process') | wc -l >> nginx_fd_data.txt

  pulp-admin rpm repo create --repo-id=${name}-${i} --display-name=${name}-${i}
  pulp-admin rpm repo delete --repo-id=${name}-${i}
done

#20 Updated by bmbouter almost 3 years ago

  • File deleted (uwsgi_2K_create_and_delete.png)

#21 Updated by bmbouter almost 3 years ago

250

Adding a graph containing uwsgi memory usage for 10K create+delete operations.

#22 Updated by bmbouter over 2 years ago

After commenting out several areas of codepaths fired by the reproducer, the following diff causes memory info to be constant even with Pulp doing a bunch of meaningful work. Specifically the patch disables authorization and the response forming and middleware. This patch does cause meaningful work to occur including the repo being created and deleted, full logging, celery task dispatch for the repo deletion, and mongodb usage.

diff --git a/server/pulp/server/webservices/views/repositories.py b/server/pulp/server/webservices/views/repositories.py
index e0389c8..c9e500d 100644
--- a/server/pulp/server/webservices/views/repositories.py
+++ b/server/pulp/server/webservices/views/repositories.py
@@ -103,7 +103,7 @@ class ReposView(View):
                                          include_distributors)
         return generate_json_response_with_pulp_encoder(processed_repos)

-    @auth_required(authorization.CREATE)
+    #@auth_required(authorization.CREATE)
     @parse_json_body(json_type=dict)
     def post(self, request):
         """ 
@@ -130,6 +130,7 @@ class ReposView(View):

         repo = serializers.Repository(repo_obj).data
         response = generate_json_response_with_pulp_encoder(repo)
+        return
         return generate_redirect_response(response, repo['_href'])

@@ -170,7 +171,7 @@ class RepoResourceView(View):

         return generate_json_response_with_pulp_encoder(repo)

-    @auth_required(authorization.DELETE)
+    #@auth_required(authorization.DELETE)
     def delete(self, request, repo_id):
         """ 
         Dispatch a task to delete a repository.
@@ -186,6 +187,7 @@ class RepoResourceView(View):
         """ 
         model.Repository.objects.get_repo_or_missing_resource(repo_id)
         async_result = repo_controller.queue_delete(repo_id)
+        return
         raise exceptions.OperationPostponed(async_result)

     @auth_required(authorization.UPDATE)

#23 Updated by bizhang over 2 years ago

  • Related to Story #2632: As a developer I want to reevaluate worker issues to see if they have been resolved by moving from Celery3 to Celery4 added

#24 Updated by bmbouter over 1 year ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (bmbouter)

I'm not actively working on this so I'm moving back to NEW and unassigned.

This bug won't affect Pulp3 because the code containing the issue is not included in Pulp3 (and never will be). We know which code contained the issue based on the investigation done in this bug.

#25 Updated by bmbouter 6 months ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF