Project

Profile

Help

Issue #2582

closed

Pipe leak in any parent Pulp celery process triggered by qpidd restart

Added by bmbouter almost 8 years ago. Updated almost 6 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
-
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
2.8.7
Platform Release:
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Quarter:

Description

Filed as reported by @pmoravec.

Whenever I restart qpidd, I see every parent Pulp celery process (i.e. for resource manager and for every worker) allocates/consumes four extra file descriptors for a pipe.

Doing so >250 times, the relevant process runs out of FDs and shuts down.

So, if one restarts qpidd broker 250times (well, this sounds as strong/unprobable requirement), pulp ends up with no worker and no manager process, unable to perform a task.

lazy reproducer

for i in $(seq 1 250); do service qpidd restart; sleep 15; done

then check if some pulp worker or manager is up

Detailed reproducer

1. Check PIDs of parent resource_manager and parent worker-* processes:

# ps aux | grep celery
apache   24202  0.1  0.5 669132 63500 ?        Ssl  14:59   0:02 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
apache   24235  0.1  0.5 669792 63712 ?        Ssl  14:59   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2
apache   24238  0.1  0.5 669116 63612 ?        Ssl  14:59   0:02 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2
apache   24273  0.1  0.2 661948 33012 ?        Ssl  14:59   0:03 /usr/bin/python /usr/bin/celery beat --app=pulp.server.async.celery_instance.celery --scheduler=pulp.server.async.scheduler.Scheduler
apache   24306  0.0  0.4 668396 55976 ?        Sl   14:59   0:00 /usr/bin/python /usr/bin/celery worker -A pulp.server.async.app -n resource_manager@%h -Q resource_manager -c 1 --events --umask 18 --pidfile=/var/run/pulp/resource_manager.pid --heartbeat-interval=30
apache   25315  0.0  0.4 669792 56860 ?        Sl   15:19   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30 --maxtasksperchild=2
apache   25811  0.0  0.4 669116 54472 ?        S    15:29   0:00 /usr/bin/python /usr/bin/celery worker -n reserved_resource_worker-1@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-1.pid --heartbeat-interval=30 --maxtasksperchild=2
root     26213  0.0  0.0 112652   960 pts/0    S+   15:36   0:00 grep --color=auto celery
#

2. take lsof of these processes:
for i in 24202 24235 24238; do lsof -p $i | sort > lsof.${i}.1; done

3. restart qpidd
service qpidd restart

4. take lsof again (to new files):
for i in 24202 24235 24238; do lsof -p $i | sort > lsof.${i}.2; done

5. Compare the lsof outputs:

  1. wc lsof*
    169 1525 18058 lsof.24202.1
    173 1561 18378 lsof.24202.2
    169 1525 18058 lsof.24235.1
    173 1561 18378 lsof.24235.2
    169 1525 18058 lsof.24238.1
    173 1561 18378 lsof.24238.2
    #

Diff shows extra:

celery 24202 apache 33r FIFO 0,8 0t0 15901973 pipe
celery 24202 apache 34w FIFO 0,8 0t0 15901973 pipe
celery 24202 apache 35r FIFO 0,8 0t0 15901975 pipe
celery 24202 apache 36w FIFO 0,8 0t0 15901975 pipe

6. Check /proc:

  1. file /proc/24202/fd/33 /proc/24202/fd/34 /proc/24202/fd/35 /proc/24202/fd/36
    /proc/24202/fd/33: broken symbolic link to `pipe:[15901973]'
    /proc/24202/fd/34: broken symbolic link to `pipe:[15901973]'
    /proc/24202/fd/35: broken symbolic link to `pipe:[15901975]'
    /proc/24202/fd/36: broken symbolic link to `pipe:[15901975]'
    #

7. goto 3 for repeat


Related issues

Related to Pulp - Story #2632: As a developer I want to reevaluate worker issues to see if they have been resolved by moving from Celery3 to Celery4CLOSED - WONTFIX

Actions
Actions #1

Updated by bizhang almost 8 years ago

  • Priority changed from Normal to High
  • Sprint/Milestone set to 33
  • Severity changed from 2. Medium to 3. High
Actions #2

Updated by bizhang almost 8 years ago

  • Triaged changed from No to Yes
Actions #4

Updated by bizhang almost 8 years ago

  • Status changed from NEW to ASSIGNED
Actions #5

Updated by bizhang almost 8 years ago

  • Assignee set to bizhang
Actions #6

Updated by mhrivnak almost 8 years ago

  • Sprint/Milestone changed from 33 to 34
Actions #7

Updated by bmbouter almost 8 years ago

Since we're moving away from Celery3.1 and into Celery4.0 I think we should defer this task. There are several bugs filed (maybe 4+) around memory leaks, cpu high usage, and file descriptor leaks against the Pulp workers. We should take all of them, and make one big testing story to evaluate correctness in all of those aspects with Celery4.0 and qpidd. I think that work would make the most sense as part of Pulp3 (as in add the pulp3 tag).

Actions #8

Updated by bizhang almost 8 years ago

  • Related to Story #2632: As a developer I want to reevaluate worker issues to see if they have been resolved by moving from Celery3 to Celery4 added
Actions #9

Updated by bizhang almost 8 years ago

I just retested with the rawhide packages and the file descriptor leak was not present

[root@dev ~]# for i in 18676 18682 18725 18729; do lsof -p $i | sort > lsof.${i}.1; done
[root@dev ~]# for i in $(seq 1 10); do service qpidd restart; sleep 15; done
[root@dev ~]# for i in 18676 18682 18725 18729; do lsof -p $i | sort > lsof.${i}.2; done
[root@dev ~]# wc lsof.*
   173   1557  17742 lsof.18676.1
   173   1557  17742 lsof.18676.2
   173   1557  17742 lsof.18682.1
   173   1557  17742 lsof.18682.2
   173   1557  17742 lsof.18725.1
   173   1557  17742 lsof.18725.2
   173   1557  17742 lsof.18729.1
   173   1557  17742 lsof.18729.2
python2-celery.noarch                    4.0.2-2.fc26                    @rawhide
python2-kombu.noarch                     1:4.0.2-4.fc26                  @rawhide
python-qpid.noarch                       1.35.0-3.fc26                   @rawhide
Actions #10

Updated by bizhang almost 8 years ago

  • Status changed from ASSIGNED to CLOSED - WORKSFORME
Actions #11

Updated by bizhang almost 8 years ago

  • Status changed from CLOSED - WORKSFORME to NEW
Actions #12

Updated by mhrivnak almost 8 years ago

  • Assignee deleted (bizhang)
  • Sprint/Milestone deleted (34)

We decided to wait on this issue, and let the upgrade to celery 4 fix it for us. We will try to reproduce on the new celery 4 stack, but expect it to be resolved.

See story https://pulp.plan.io/issues/2632

Actions #13

Updated by dkliban@redhat.com almost 6 years ago

  • Status changed from NEW to CLOSED - CURRENTRELEASE
Actions #14

Updated by bmbouter almost 6 years ago

  • Tags Pulp 2 added

Also available in: Atom PDF