Project

Profile

Help

Issue #1673

closed

Pulp's worker watcher does not notice workers that got killed by OOM killer and their tasks stay "running" forever

Added by rbarlow about 8 years ago. Updated about 5 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
Platform Release:
2.9.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 5
Quarter:

Description

In investing #1662 I had a worker get killed by the OOM killer. After this happened, the worker watcher did not become aware and so did not cancel the task that was now lost. Thus, the task stayed in a running state forever. Here is the log output of the worker getting killed by OOM killer:

Feb 12 19:14:10 dev pulp[20553]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[20f6a5df-babd-4d2c-a0b1-2df849dd9414]
Feb 12 19:14:10 dev pulp[20561]: celery.worker.strategy:INFO: Received task: pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd]
Feb 12 19:14:10 dev pulp[20561]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5]
Feb 12 19:14:10 dev pulp[20553]: celery.worker.job:INFO: Task pulp.server.async.tasks._queue_reserved_task[20f6a5df-babd-4d2c-a0b1-2df849dd9414] succeeded in 0.0379113480012s: None
Feb 12 19:14:36 dev kernel: celery invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 12 19:14:36 dev kernel: celery cpuset=/ mems_allowed=0
Feb 12 19:14:36 dev kernel: CPU: 0 PID: 20751 Comm: celery Not tainted 4.2.3-300.fc23.x86_64 #1
Feb 12 19:14:36 dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
Feb 12 19:14:36 dev kernel:  0000000000000000 000000004e0abf34 ffff88001a627948 ffffffff81771fca
Feb 12 19:14:36 dev kernel:  0000000000000000 ffff880075ba9c40 ffff88001a6279d8 ffffffff8177150e
Feb 12 19:14:36 dev kernel:  65a4d4fb1a6279a8 fffeefff00000000 0000000000000002 000000000209e203
Feb 12 19:14:36 dev kernel: Call Trace:
Feb 12 19:14:36 dev kernel:  [<ffffffff81771fca>] dump_stack+0x45/0x57
Feb 12 19:14:36 dev kernel:  [<ffffffff8177150e>] dump_header+0x86/0x207
Feb 12 19:14:36 dev kernel:  [<ffffffff811a627b>] oom_kill_process+0x1db/0x3a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a69ec>] out_of_memory+0x54c/0x5a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811ac9d8>] __alloc_pages_nodemask+0x838/0x980
Feb 12 19:14:36 dev kernel:  [<ffffffff811f4f71>] alloc_pages_current+0x91/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff811a2a0b>] __page_cache_alloc+0xab/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a4c34>] filemap_fault+0x154/0x410
Feb 12 19:14:36 dev kernel:  [<ffffffff811d10de>] __do_fault+0x4e/0xf0
Feb 12 19:14:36 dev kernel:  [<ffffffff811d6218>] handle_mm_fault+0xf58/0x17d0
Feb 12 19:14:36 dev kernel:  [<ffffffff81643efb>] ? sock_recvmsg+0x3b/0x50
Feb 12 19:14:36 dev kernel:  [<ffffffff81065447>] __do_page_fault+0x197/0x400
Feb 12 19:14:36 dev kernel:  [<ffffffff81065777>] trace_do_page_fault+0x47/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff8105fd2e>] do_async_page_fault+0x1e/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff8177a968>] async_page_fault+0x28/0x30
Feb 12 19:14:36 dev kernel: Mem-Info:
Feb 12 19:14:36 dev kernel: active_anon:477508 inactive_anon:408 isolated_anon:0
                             active_file:34 inactive_file:0 isolated_file:17
                             unevictable:0 dirty:0 writeback:0 unstable:0
                             slab_reclaimable:6716 slab_unreclaimable:9582
                             mapped:353 shmem:479 pagetables:8316 bounce:0
                             free:3374 free_pcp:35 free_cma:0
Feb 12 19:14:36 dev kernel: Node 0 DMA free:7968kB min:40kB low:48kB high:60kB active_anon:6540kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB ml
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 1982 1982 1982
Feb 12 19:14:36 dev kernel: Node 0 DMA32 free:5528kB min:5580kB low:6972kB high:8368kB active_anon:1903492kB inactive_anon:1624kB active_file:136kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 0 0 0
Feb 12 19:14:36 dev kernel: Node 0 DMA: 14*4kB (UEM) 7*8kB (UEM) 29*16kB (E) 21*32kB (EM) 9*64kB (UEM) 4*128kB (UM) 2*256kB (UM) 2*512kB (UM) 0*1024kB 2*2048kB (UM) 0*4096kB = 7968kB
Feb 12 19:14:36 dev kernel: Node 0 DMA32: 1411*4kB (UM) 6*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5708kB
Feb 12 19:14:36 dev kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 12 19:14:36 dev kernel: 539 total pagecache pages
Feb 12 19:14:36 dev kernel: 0 pages in swap cache
Feb 12 19:14:36 dev kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 12 19:14:36 dev kernel: Free swap  = 0kB
Feb 12 19:14:36 dev kernel: Total swap = 0kB
Feb 12 19:14:36 dev kernel: 524158 pages RAM
Feb 12 19:14:36 dev kernel: 0 pages HighMem/MovableOnly
Feb 12 19:14:36 dev kernel: 11910 pages reserved
Feb 12 19:14:36 dev kernel: 0 pages hwpoisoned
Feb 12 19:14:36 dev kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 12 19:14:36 dev kernel: [  374]     0   374    13724       95      31       4        0             0 systemd-journal
Feb 12 19:14:36 dev kernel: [  444]     0   444     6084       78      16       3        0             0 systemd-logind
Feb 12 19:14:36 dev kernel: [  473]    81   473    32144      165      33       3        0          -900 dbus-daemon
Feb 12 19:14:36 dev kernel: [  530]     0   530     2169       32      10       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  533]     0   533     2169       29       9       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  708]     0   708    23354     3084      45       3        0             0 dhclient
Feb 12 19:14:36 dev kernel: [ 2320]     0  2320    25094      104      38       3        0             0 gssproxy
Feb 12 19:14:36 dev kernel: [ 2445]    32  2445    16464      145      35       3        0             0 rpcbind
Feb 12 19:14:36 dev kernel: [ 2447]     0  2447     4291       45      14       3        0             0 rpc.idmapd
Feb 12 19:14:36 dev kernel: [ 2450]    29  2450    10605      208      26       3        0             0 rpc.statd
Feb 12 19:14:36 dev kernel: [ 2451]     0  2451    10652      183      24       3        0             0 rpc.mountd
Feb 12 19:14:36 dev kernel: [ 3468]     0  3468     5704      151      17       3        0             0 crond
Feb 12 19:14:36 dev kernel: [ 3644]     0  3644    11250      364      24       3        0         -1000 systemd-udevd
Feb 12 19:14:36 dev kernel: [ 4465]     0  4465    12268       92      25       3        0         -1000 auditd
Feb 12 19:14:36 dev kernel: [16458]   994 16458   124840     1169      76       3        0             0 qpidd
Feb 12 19:14:36 dev kernel: [18225]    23 18225    21336      894      39       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18227]    23 18227    24838     2670      47       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18229]    23 18229     6992       52      19       3        0             0 log_file_daemon
Feb 12 19:14:36 dev kernel: [18356]   184 18356   616450    14553     555       5        0             0 mongod
Feb 12 19:14:36 dev kernel: [18488]     0 18488    20354      207      41       3        0         -1000 sshd
Feb 12 19:14:36 dev kernel: [20551]     0 20551   313477     4014     103       4        0             0 python
Feb 12 19:14:36 dev kernel: [20552]    48 20552   127793    14594     183       4        0             0 pulp_streamer
Feb 12 19:14:36 dev kernel: [20553]    48 20553   144600    14097     181       3        0             0 celery
Feb 12 19:14:36 dev kernel: [20554]    48 20554   144116     8929     114       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20556]     0 20556    68831      798     137       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20561]    48 20561   144601    14084     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20563]    48 20563   144601    14088     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20565]    48 20565   144600    14076     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20571]    48 20571   144600    14082     180       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20577]    48 20577   292827    14987     294       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20578]    48 20578   292828    14691     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20580]    48 20580   292956    14886     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20581]    48 20581   140971     5912     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20582]    48 20582   140971     5912     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20583]    48 20583   157355     5912     195       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20584]    48 20584   221349    13428     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20585]    48 20585   237734    13428     283       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20586]    48 20586   221349    13379     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20587]    48 20587    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20588]    48 20588    72051     1395     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20590]    48 20590    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20593]    48 20593    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20595]    48 20595    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20807]    48 20807    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20853]    48 20853   161770    13916     174       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20854]    48 20854   145386    13916     173       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20856]    48 20856   162967    14063     176       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20857]    48 20857   403431   228560     605       5        0             0 celery
Feb 12 19:14:36 dev kernel: [20858]    48 20858   144541    14014     171       3        0             0 celery
Feb 12 19:14:36 dev kernel: [ 3128]     0  3128    37869      317      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3131]  1001  3131    11238      142      27       3        0             0 systemd
Feb 12 19:14:36 dev kernel: [ 3133]  1001  3133    43881      688      49       3        0             0 (sd-pam)
Feb 12 19:14:36 dev kernel: [ 3136]  1001  3136    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3137]  1001  3137     4333      578      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3443]     0  3443    37869      315      76       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3447]  1001  3447    37869      367      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3448]  1001  3448     4333      547      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3500]     0  3500    30367      216      62       3        0             0 sudo
Feb 12 19:14:36 dev kernel: [ 3501]     0  3501    13731       59      27       3        0             0 journalctl
Feb 12 19:14:36 dev kernel: [ 3565]     0  3565    37869      315      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3569]  1001  3569    37869      314      74       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3570]  1001  3570     4333      546      13       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3620]  1001  3620    13060     1084      31       3        0             0 dstat
Feb 12 19:14:36 dev kernel: [ 3717]     0  3717    37869      316      78       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3721]  1001  3721    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3722]  1001  3722     4333      555      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3775]  1001  3775    14124      166      32       3        0             0 top
Feb 12 19:14:36 dev kernel: [ 4001]    48  4001    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 4088]    48  4088    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 5344]  1001  5344    48752     5025      97       3        0             0 pulp-admin
Feb 12 19:14:36 dev kernel: Out of memory: Kill process 20857 (celery) score 447 or sacrifice child
Feb 12 19:14:36 dev kernel: Killed process 20857 (celery) total-vm:1613724kB, anon-rss:914236kB, file-rss:4kB
Feb 12 19:14:36 dev kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 12 19:14:36 dev kernel: httpd cpuset=/ mems_allowed=0
Feb 12 19:14:36 dev kernel: CPU: 2 PID: 20771 Comm: httpd Not tainted 4.2.3-300.fc23.x86_64 #1
Feb 12 19:14:36 dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
Feb 12 19:14:36 dev kernel:  0000000000000000 00000000c5cb6a30 ffff88000d8c7948 ffffffff81771fca
Feb 12 19:14:36 dev kernel:  0000000000000000 ffff88000f3e9c40 ffff88000d8c79d8 ffffffff8177150e
Feb 12 19:14:36 dev kernel:  65a4d4fb810b5ae5 fffeefff00000000 0000000000000002 000000000209e303
Feb 12 19:14:36 dev kernel: Call Trace:
Feb 12 19:14:36 dev kernel:  [<ffffffff81771fca>] dump_stack+0x45/0x57
Feb 12 19:14:36 dev kernel:  [<ffffffff8177150e>] dump_header+0x86/0x207
Feb 12 19:14:36 dev kernel:  [<ffffffff811a627b>] oom_kill_process+0x1db/0x3a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a69ec>] out_of_memory+0x54c/0x5a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811ac9d8>] __alloc_pages_nodemask+0x838/0x980
Feb 12 19:14:36 dev kernel:  [<ffffffff811f4f71>] alloc_pages_current+0x91/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff811a2a0b>] __page_cache_alloc+0xab/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a4c34>] filemap_fault+0x154/0x410
Feb 12 19:14:36 dev kernel:  [<ffffffff811d10de>] __do_fault+0x4e/0xf0
Feb 12 19:14:36 dev kernel:  [<ffffffff811d6218>] handle_mm_fault+0xf58/0x17d0
Feb 12 19:14:36 dev kernel:  [<ffffffff8111b8a2>] ? do_futex+0x112/0xb00
Feb 12 19:14:36 dev kernel:  [<ffffffff81065447>] __do_page_fault+0x197/0x400
Feb 12 19:14:36 dev kernel:  [<ffffffff81065777>] trace_do_page_fault+0x47/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff8105fd2e>] do_async_page_fault+0x1e/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff8177a968>] async_page_fault+0x28/0x30
Feb 12 19:14:36 dev kernel: Mem-Info:
Feb 12 19:14:36 dev kernel: active_anon:477500 inactive_anon:408 isolated_anon:0
                             active_file:31 inactive_file:0 isolated_file:0
                             unevictable:0 dirty:0 writeback:0 unstable:0
                             slab_reclaimable:6690 slab_unreclaimable:9554
                             mapped:291 shmem:479 pagetables:8316 bounce:0
                             free:3361 free_pcp:64 free_cma:0
Feb 12 19:14:36 dev kernel: Node 0 DMA free:7968kB min:40kB low:48kB high:60kB active_anon:6540kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB ml
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 1982 1982 1982
Feb 12 19:14:36 dev kernel: Node 0 DMA32 free:82140kB min:5580kB low:6972kB high:8368kB active_anon:1825712kB inactive_anon:1624kB active_file:124kB inactive_file:332kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 0 0 0
Feb 12 19:14:36 dev kernel: Node 0 DMA: 63*4kB (UEM) 15*8kB (UEM) 29*16kB (E) 21*32kB (EM) 9*64kB (UEM) 4*128kB (UM) 2*256kB (UM) 2*512kB (UM) 0*1024kB 2*2048kB (UM) 0*4096kB = 8228kB
Feb 12 19:14:36 dev kernel: Node 0 DMA32: 15988*4kB (UEM) 4728*8kB (UEM) 992*16kB (UEM) 144*32kB (UEM) 17*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 123472kB
Feb 12 19:14:36 dev kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 12 19:14:36 dev kernel: 945 total pagecache pages
Feb 12 19:14:36 dev kernel: 0 pages in swap cache
Feb 12 19:14:36 dev kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 12 19:14:36 dev kernel: Free swap  = 0kB
Feb 12 19:14:36 dev kernel: Total swap = 0kB
Feb 12 19:14:36 dev kernel: 524158 pages RAM
Feb 12 19:14:36 dev kernel: 0 pages HighMem/MovableOnly
Feb 12 19:14:36 dev kernel: 11910 pages reserved
Feb 12 19:14:36 dev kernel: 0 pages hwpoisoned
Feb 12 19:14:36 dev kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 12 19:14:36 dev kernel: [  374]     0   374    13724       95      31       4        0             0 systemd-journal
Feb 12 19:14:36 dev kernel: [  444]     0   444     6084       78      16       3        0             0 systemd-logind
Feb 12 19:14:36 dev kernel: [  473]    81   473    32144      165      33       3        0          -900 dbus-daemon
Feb 12 19:14:36 dev kernel: [  530]     0   530     2169       32      10       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  533]     0   533     2169       29       9       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  708]     0   708    23354     3084      45       3        0             0 dhclient
Feb 12 19:14:36 dev kernel: [ 2320]     0  2320    25094      104      38       3        0             0 gssproxy
Feb 12 19:14:36 dev kernel: [ 2445]    32  2445    16464      145      35       3        0             0 rpcbind
Feb 12 19:14:36 dev kernel: [ 2447]     0  2447     4291       45      14       3        0             0 rpc.idmapd
Feb 12 19:14:36 dev kernel: [ 2450]    29  2450    10605      208      26       3        0             0 rpc.statd
Feb 12 19:14:36 dev kernel: [ 2451]     0  2451    10652      183      24       3        0             0 rpc.mountd
Feb 12 19:14:36 dev kernel: [ 3468]     0  3468     5704      151      17       3        0             0 crond
Feb 12 19:14:36 dev kernel: [ 3644]     0  3644    11250      364      24       3        0         -1000 systemd-udevd
Feb 12 19:14:36 dev kernel: [ 4465]     0  4465    12268       92      25       3        0         -1000 auditd
Feb 12 19:14:36 dev kernel: [16458]   994 16458   124840     1169      76       3        0             0 qpidd
Feb 12 19:14:36 dev kernel: [18225]    23 18225    21336      894      39       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18227]    23 18227    24838     2670      47       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18229]    23 18229     6992       52      19       3        0             0 log_file_daemon
Feb 12 19:14:36 dev kernel: [18356]   184 18356   616450    14553     555       5        0             0 mongod
Feb 12 19:14:36 dev kernel: [18488]     0 18488    20354      207      41       3        0         -1000 sshd
Feb 12 19:14:36 dev kernel: [20551]     0 20551   313477     4014     103       4        0             0 python
Feb 12 19:14:36 dev kernel: [20552]    48 20552   127793    14593     183       4        0             0 pulp_streamer
Feb 12 19:14:36 dev kernel: [20553]    48 20553   144600    14097     181       3        0             0 celery
Feb 12 19:14:36 dev kernel: [20554]    48 20554   144116     8929     114       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20556]     0 20556    68831      798     137       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20561]    48 20561   144601    14084     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20563]    48 20563   144601    14088     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20565]    48 20565   144600    14076     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20571]    48 20571   144600    14082     180       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20577]    48 20577   292827    14986     294       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20578]    48 20578   292828    14690     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20580]    48 20580   292956    15204     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20581]    48 20581   140971     5911     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20582]    48 20582   140971     5912     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20583]    48 20583   157355     5911     195       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20584]    48 20584   221349    13427     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20585]    48 20585   237734    13427     283       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20586]    48 20586   221349    13378     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20587]    48 20587    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20588]    48 20588    72051     1395     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20590]    48 20590    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20593]    48 20593    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20595]    48 20595    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20807]    48 20807    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20853]    48 20853   161770    13916     174       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20854]    48 20854   145386    13916     173       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20856]    48 20856   162967    14063     176       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20858]    48 20858   144541    14014     171       3        0             0 celery
Feb 12 19:14:36 dev kernel: [ 3128]     0  3128    37869      317      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3131]  1001  3131    11238      142      27       3        0             0 systemd
Feb 12 19:14:36 dev kernel: [ 3133]  1001  3133    43881      688      49       3        0             0 (sd-pam)
Feb 12 19:14:36 dev kernel: [ 3136]  1001  3136    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3137]  1001  3137     4333      578      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3443]     0  3443    37869      315      76       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3447]  1001  3447    37869      367      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3448]  1001  3448     4333      547      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3500]     0  3500    30367      216      62       3        0             0 sudo
Feb 12 19:14:36 dev kernel: [ 3501]     0  3501    13731       59      27       3        0             0 journalctl
Feb 12 19:14:36 dev kernel: [ 3565]     0  3565    37869      315      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3569]  1001  3569    37869      314      74       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3570]  1001  3570     4333      546      13       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3620]  1001  3620    13060     1084      31       3        0             0 dstat
Feb 12 19:14:36 dev kernel: [ 3717]     0  3717    37869      316      78       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3721]  1001  3721    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3722]  1001  3722     4333      555      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3775]  1001  3775    14124      166      32       3        0             0 top
Feb 12 19:14:36 dev kernel: [ 4001]    48  4001    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 4088]    48  4088    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 5344]  1001  5344    48752     5025      97       3        0             0 pulp-admin
Feb 12 19:14:36 dev kernel: Out of memory: Kill process 2099 (celery) score 447 or sacrifice child
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Task pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd] raised unexpected: WorkerLostError('Worker exited prematurel
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Traceback (most recent call last):
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)   File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)     human_status(exitcode)),
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forkin
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)   "MongoClient opened before fork. Create MongoClient "
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)
Feb 12 19:14:36 dev celery[20561]: reserved_resource_worker-0@dev ready.
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5] succeeded in 0.351034021005s: None
Feb 12 19:31:16 dev sudo[6302]:  vagrant : TTY=pts/0 ; PWD=/home/vagrant ; USER=root ; COMMAND=/bin/systemctl status pulp_worker-0

The celery parent process is still running after the OOM:

$ sudo systemctl status pulp_worker-0 
● pulp_worker-0.service - Pulp Worker #0
   Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2016-02-11 22:43:22 UTC; 20h ago
 Main PID: 20561 (celery)
   CGroup: /system.slice/pulp_worker-0.service
           ├─ 5369 /bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
           └─20561 /bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30

Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Task pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd] raised unexpected: WorkerLostError('Wor...al 9 (SIGKILL).',)
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Traceback (most recent call last):
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)   File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)     human_status(exitcode)),
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after for...
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)   "MongoClient opened before fork. Create MongoClient "
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)
Feb 12 19:14:36 dev celery[20561]: reserved_resource_worker-0@dev ready.
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5] succeeded in 0.351034021005s: None
Hint: Some lines were ellipsized, use -l to show in full.

And here you can see that Pulp still thinks the task is running:

$ pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+
Operations:  associate
Resources:   ohno (repository), rhel7 (repository)
State:       Running
Start Time:  2016-02-12T19:14:10Z
Finish Time: Incomplete
Task Id:     6785cda1-4237-4ae7-aad2-313e8d3b77dd

My theory is that Celery's parent process noticed that its child died and automatically started a new one to replace it for us. Since our worker watcher just makes sure that worker 0 is hearbeating on an interval, it is none the wiser that the process has been replaced. Unfortunately, nothing seems to notice that the task was lost and thus it will remain in the running state forever, despite being lost. Unfortunately, I don't have a proposal for what we could do differently as I need to fix my original issue at the moment.

As for triage, this is probably an issue we've had for all releases 2.4+, so I don't think it's a 2.8.0 blocker.


Related issues

Related to Pulp - Issue #2045: Task stuck at waiting if child process segfaultsCLOSED - DUPLICATEActions

Also available in: Atom PDF