Project

Profile

Help

Issue #2492

closed

Small memory leak in httpd when dispatching Pulp tasks

Added by bmbouter over 7 years ago. Updated over 3 years ago.

Status:
CLOSED - WONTFIX
Priority:
Normal
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
1. Low
Version:
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

standalone pulp reproducer:

name="some-very-long-repo-name-some-very-long-repo-name-some-very-long-repo-name-test"

i=1;
while true; do
  pulp-admin rpm repo create --repo-id=${name}-${i} --display-name=${name}-${i}
  pulp-admin rpm repo delete --repo-id=${name}-${i} 
  sleep 2
done

(plus some ps monitoring)

Memory usage (RSS) grows approx 110bytes per one task /220bytes per one iteration of above cycle (sum of RSS increases per individual processes)


Files


Related issues

Related to Pulp - Story #2632: As a developer I want to reevaluate worker issues to see if they have been resolved by moving from Celery3 to Celery4CLOSED - WONTFIX

Actions
Actions #1

Updated by mhrivnak over 7 years ago

We at least need to identify a root cause soon. If fixing is complex, we could consider a different priority for that.

Actions #2

Updated by bizhang over 7 years ago

  • Sprint/Milestone set to 31
  • Triaged changed from No to Yes
Actions #3

Updated by bmbouter over 7 years ago

  • Description updated (diff)
  • Sprint/Milestone deleted (31)
Actions #4

Updated by bmbouter over 7 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
Actions #5

Updated by bmbouter over 7 years ago

  • File wsgi_memory_growth.png added

On a 2.10-dev installation of Pulp, I modified the httpd conf file to have exactly 1 WSGI process for Pulp which causes it to handle all requests.

I then ran this reproducer:

name="some-very-long-repo-name-some-very-long-repo-name-some-very-long-repo-name-test"

for i in {0..1000}; do
  sudo ps -up $(pgrep -f 'wsgi:pulp\)')

  pulp-admin rpm repo create --repo-id=${name}-${i} --display-name=${name}-${i}
  pulp-admin rpm repo delete --repo-id=${name}-${i}
done

I then grab all of the process memory footprints into one text file with:

cat output.txt | grep FOREGROUND > filtered.txt
awk '{print $6}' ~/filtered.txt > ~/filtered_mem_only.txt

I then graph that using Google Docs and I get the attached figure.

Actions #6

Updated by bmbouter over 7 years ago

  • File deleted (wsgi_memory_growth.png)
Actions #7

Updated by bmbouter over 7 years ago

  • File wsgi_memory_growth.png added

I ran 3 additional test cases.

  • I reran the original reproducer, but using RabbitMQ instead of Qpid
  • using RabbitMQ, I reran the reproducer but only had it create the repos. This codepath does not involve Celery or the message bus.
  • using RabbitMQ, I reran the reproducer but only had it delete the repos. This codepath does involve Celery and the message bus.

All 4 test cases are in one figure for comparison. It includes the previous data also so I'm deleting the previous figure.

I think the data supports several conclusions:

1. The memory leak is not unique to Qpid
2. Some of the memory growth is not related to Celery, but after ~500 requests memory usage does not grow
3. The memory is related to Celery

I don't think we know if the memory growth will continue to grow monotonically with continued tasking system use, or if it will eventually stabilize.

Actions #8

Updated by bmbouter over 7 years ago

  • File wsgi_memory_growth_10000.png added

I ran the following 3 tests for 10K requests and attached the image.

  • create+delete using RabbitMQ
  • create only using RabbitMQ
  • delete only using RabbitMQ

btw, all of this data is recorded here[0].

[0]: https://docs.google.com/a/redhat.com/spreadsheets/d/1N4Tm5uyblnHoBdy27n5_YohkfIQxkWzIr_kfZKF5n3Y/edit?usp=sharing

Actions #9

Updated by bmbouter over 7 years ago

  • File deleted (wsgi_memory_growth.png)
Actions #10

Updated by bmbouter over 7 years ago

  • File deleted (wsgi_memory_growth_10000.png)
Actions #11

Updated by bmbouter over 7 years ago

The man page states that the RSS column is in Kilobytes not bytes. Uploading a new figure with the correct axis label.

Actions #12

Updated by bmbouter over 7 years ago

From the long running tests, I think it shows a few things:

1. Celery memory usage for task dispatch contributes the most to the memory problems
2. Non-celery operations also increase memory usage but more slowly. httpd does seem to release this with a slight dip around 9K requests.
3. Memory growth seems to increase monotonically
4. The rate of worst-case growth (the red line) is approx. 0.2928 KB per create+destroy operation. That includes both the create and destroy operation.

I don't think this is severe enough to continue investigating. We should consider doing one of two things. We could make a change to the default WSGI config to enable process recycling, or we could close as WONTFIX.

What do others think?

Actions #14

Updated by mhrivnak over 7 years ago

I think process recycling is an appropriate change upstream.

The number of requests between restarts should be large so this doesn't have a big resource impact, especially since the memory leak is very small. Just making stuff up, I'd go with at least 10,000 requests between restarts. That would still keep the memory leak's impact very small.

I also thing the graceful-timeout should be large so there is ample opportunity for the process to do a graceful shutdown. There's no harm in letting that be at least several minutes.

http://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html

Actions #15

Updated by dkliban@redhat.com over 7 years ago

Is the memory leak present when httpd is handling requests for content? If so, then we should open up a bug against httpd.

If the memory leak is not present when httpd is handling requests for content, then we should consider how many Pulp API requests would need to be made in order to cause a significant increase in memory use. Based on the data provided in comment 12, it would take over 6800 API requests to cause httpd to use an extra 1 megabyte of memory. This is not a serious problem.

Introducing process recycling will only address the symptoms of the problem. It will also prevent us from noticing any other memory leaks that could be introduced in the future.

We should focus on fixing the problem. If we don't know how to fix it, we should at least leave it as is so someone else can attempt to fix it.

Actions #16

Updated by bmbouter over 7 years ago

After reading comments from others, and thinking more about this, I think we should leave this at NEW and remove it from the sprint. The main things that stand out in my mind are:

  • It allows for the potential early discovery of a more serious leak (memory, fd's, AIO contexts, whatever)
  • This leak is not severe so the benefit of making this change is very small.
  • It allows the issue to remain open for someone to really fix it with a tool like guppy
Actions #17

Updated by mhrivnak over 7 years ago

  • Sprint/Milestone set to 31
Actions #18

Updated by bmbouter over 7 years ago

  • File uwsgi_2K_create_and_delete.png added
  • Sprint/Milestone deleted (31)

I wanted to compare the memory behavior of httpd versus other webservers, so I ran the same reproducer with Pulp running against uWSGI and nginx. nginx used a fixed (literally 0 kb changes during the test) of 5292 KB. The uWSGI however leaked similarly to httpd. See the attached graph of the create+delete test using rabbitMQ with 2K requests.

Actions #19

Updated by bmbouter over 7 years ago

I added some to my reproducer with additional webserver support and the tracking of file descriptors. In terms of next steps, I plan to heavily comment out the codepaths of these and trying to observe what aspects of the request make additional. This was originally an idea from mhrivnak.

name="some-very-long-repo-name-some-very-long-repo-name-some-very-long-repo-name-test"

rm httpd_mem_data.txt
rm httpd_fd_data.txt

rm uwsgi_mem_data.txt
rm uwsgi_fd_data.txt

rm nginx_mem_data.txt
rm nginx_fd_data.txt

rm -rf ~/devel/mem_output/*
sudo rm /var/lib/pulp/count_output.txt

for i in {0..2000}; do
  #sudo ps --no-header -up $(pgrep -f 'wsgi:pulp\)') | awk '{ print $6 }' >> httpd_mem_data.txt
  #sudo lsof -p $(pgrep -f 'wsgi:pulp\)') | wc -l >> httpd_fd_data.txt

  sudo ps --no-header -up $(pgrep -f 'uwsgi pulp_uwsgi.ini' | tail -n1) | awk '{ print $6 }' >> uwsgi_mem_data.txt
  sudo lsof -p $(pgrep -f 'uwsgi pulp_uwsgi.ini' | tail -n1) | wc -l >> uwsgi_fd_data.txt

  sudo ps --no-header -up $(pgrep -f 'nginx: worker process') | awk '{ print $6 }' >> nginx_mem_data.txt
  sudo lsof -p $(pgrep -f 'nginx: worker process') | wc -l >> nginx_fd_data.txt

  pulp-admin rpm repo create --repo-id=${name}-${i} --display-name=${name}-${i}
  pulp-admin rpm repo delete --repo-id=${name}-${i}
done
Actions #20

Updated by bmbouter over 7 years ago

  • File deleted (uwsgi_2K_create_and_delete.png)
Actions #21

Updated by bmbouter over 7 years ago

Adding a graph containing uwsgi memory usage for 10K create+delete operations.

Actions #22

Updated by bmbouter about 7 years ago

After commenting out several areas of codepaths fired by the reproducer, the following diff causes memory info to be constant even with Pulp doing a bunch of meaningful work. Specifically the patch disables authorization and the response forming and middleware. This patch does cause meaningful work to occur including the repo being created and deleted, full logging, celery task dispatch for the repo deletion, and mongodb usage.

diff --git a/server/pulp/server/webservices/views/repositories.py b/server/pulp/server/webservices/views/repositories.py
index e0389c8..c9e500d 100644
--- a/server/pulp/server/webservices/views/repositories.py
+++ b/server/pulp/server/webservices/views/repositories.py
@@ -103,7 +103,7 @@ class ReposView(View):
                                          include_distributors)
         return generate_json_response_with_pulp_encoder(processed_repos)

-    @auth_required(authorization.CREATE)
+    #@auth_required(authorization.CREATE)
     @parse_json_body(json_type=dict)
     def post(self, request):
         """
@@ -130,6 +130,7 @@ class ReposView(View):

         repo = serializers.Repository(repo_obj).data
         response = generate_json_response_with_pulp_encoder(repo)
+        return
         return generate_redirect_response(response, repo['_href'])


@@ -170,7 +171,7 @@ class RepoResourceView(View):

         return generate_json_response_with_pulp_encoder(repo)

-    @auth_required(authorization.DELETE)
+    #@auth_required(authorization.DELETE)
     def delete(self, request, repo_id):
         """
         Dispatch a task to delete a repository.
@@ -186,6 +187,7 @@ class RepoResourceView(View):
         """
         model.Repository.objects.get_repo_or_missing_resource(repo_id)
         async_result = repo_controller.queue_delete(repo_id)
+        return
         raise exceptions.OperationPostponed(async_result)

     @auth_required(authorization.UPDATE)
Actions #23

Updated by bizhang about 7 years ago

  • Related to Story #2632: As a developer I want to reevaluate worker issues to see if they have been resolved by moving from Celery3 to Celery4 added
Actions #24

Updated by bmbouter about 6 years ago

  • Status changed from ASSIGNED to NEW
  • Assignee deleted (bmbouter)

I'm not actively working on this so I'm moving back to NEW and unassigned.

This bug won't affect Pulp3 because the code containing the issue is not included in Pulp3 (and never will be). We know which code contained the issue based on the investigation done in this bug.

Actions #25

Updated by bmbouter about 5 years ago

  • Tags Pulp 2 added
Actions #26

Updated by dalley over 3 years ago

  • Status changed from NEW to CLOSED - WONTFIX

Shouldn't be a problem in Pulp 3

Also available in: Atom PDF