Project

Profile

Help

Issue #1673

Pulp's worker watcher does not notice workers that got killed by OOM killer and their tasks stay "running" forever

Added by rbarlow over 5 years ago. Updated about 2 years ago.

Status:
CLOSED - CURRENTRELEASE
Priority:
High
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
Due date:
Estimated time:
Severity:
3. High
Version:
Platform Release:
2.9.1
OS:
Triaged:
Yes
Groomed:
No
Sprint Candidate:
No
Tags:
Pulp 2
Sprint:
Sprint 5
Quarter:

Description

In investing #1662 I had a worker get killed by the OOM killer. After this happened, the worker watcher did not become aware and so did not cancel the task that was now lost. Thus, the task stayed in a running state forever. Here is the log output of the worker getting killed by OOM killer:

Feb 12 19:14:10 dev pulp[20553]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[20f6a5df-babd-4d2c-a0b1-2df849dd9414]
Feb 12 19:14:10 dev pulp[20561]: celery.worker.strategy:INFO: Received task: pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd]
Feb 12 19:14:10 dev pulp[20561]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5]
Feb 12 19:14:10 dev pulp[20553]: celery.worker.job:INFO: Task pulp.server.async.tasks._queue_reserved_task[20f6a5df-babd-4d2c-a0b1-2df849dd9414] succeeded in 0.0379113480012s: None
Feb 12 19:14:36 dev kernel: celery invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 12 19:14:36 dev kernel: celery cpuset=/ mems_allowed=0
Feb 12 19:14:36 dev kernel: CPU: 0 PID: 20751 Comm: celery Not tainted 4.2.3-300.fc23.x86_64 #1
Feb 12 19:14:36 dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
Feb 12 19:14:36 dev kernel:  0000000000000000 000000004e0abf34 ffff88001a627948 ffffffff81771fca
Feb 12 19:14:36 dev kernel:  0000000000000000 ffff880075ba9c40 ffff88001a6279d8 ffffffff8177150e
Feb 12 19:14:36 dev kernel:  65a4d4fb1a6279a8 fffeefff00000000 0000000000000002 000000000209e203
Feb 12 19:14:36 dev kernel: Call Trace:
Feb 12 19:14:36 dev kernel:  [<ffffffff81771fca>] dump_stack+0x45/0x57
Feb 12 19:14:36 dev kernel:  [<ffffffff8177150e>] dump_header+0x86/0x207
Feb 12 19:14:36 dev kernel:  [<ffffffff811a627b>] oom_kill_process+0x1db/0x3a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a69ec>] out_of_memory+0x54c/0x5a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811ac9d8>] __alloc_pages_nodemask+0x838/0x980
Feb 12 19:14:36 dev kernel:  [<ffffffff811f4f71>] alloc_pages_current+0x91/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff811a2a0b>] __page_cache_alloc+0xab/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a4c34>] filemap_fault+0x154/0x410
Feb 12 19:14:36 dev kernel:  [<ffffffff811d10de>] __do_fault+0x4e/0xf0
Feb 12 19:14:36 dev kernel:  [<ffffffff811d6218>] handle_mm_fault+0xf58/0x17d0
Feb 12 19:14:36 dev kernel:  [<ffffffff81643efb>] ? sock_recvmsg+0x3b/0x50
Feb 12 19:14:36 dev kernel:  [<ffffffff81065447>] __do_page_fault+0x197/0x400
Feb 12 19:14:36 dev kernel:  [<ffffffff81065777>] trace_do_page_fault+0x47/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff8105fd2e>] do_async_page_fault+0x1e/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff8177a968>] async_page_fault+0x28/0x30
Feb 12 19:14:36 dev kernel: Mem-Info:
Feb 12 19:14:36 dev kernel: active_anon:477508 inactive_anon:408 isolated_anon:0
                             active_file:34 inactive_file:0 isolated_file:17
                             unevictable:0 dirty:0 writeback:0 unstable:0
                             slab_reclaimable:6716 slab_unreclaimable:9582
                             mapped:353 shmem:479 pagetables:8316 bounce:0
                             free:3374 free_pcp:35 free_cma:0
Feb 12 19:14:36 dev kernel: Node 0 DMA free:7968kB min:40kB low:48kB high:60kB active_anon:6540kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB ml
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 1982 1982 1982
Feb 12 19:14:36 dev kernel: Node 0 DMA32 free:5528kB min:5580kB low:6972kB high:8368kB active_anon:1903492kB inactive_anon:1624kB active_file:136kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 0 0 0
Feb 12 19:14:36 dev kernel: Node 0 DMA: 14*4kB (UEM) 7*8kB (UEM) 29*16kB (E) 21*32kB (EM) 9*64kB (UEM) 4*128kB (UM) 2*256kB (UM) 2*512kB (UM) 0*1024kB 2*2048kB (UM) 0*4096kB = 7968kB
Feb 12 19:14:36 dev kernel: Node 0 DMA32: 1411*4kB (UM) 6*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5708kB
Feb 12 19:14:36 dev kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 12 19:14:36 dev kernel: 539 total pagecache pages
Feb 12 19:14:36 dev kernel: 0 pages in swap cache
Feb 12 19:14:36 dev kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 12 19:14:36 dev kernel: Free swap  = 0kB
Feb 12 19:14:36 dev kernel: Total swap = 0kB
Feb 12 19:14:36 dev kernel: 524158 pages RAM
Feb 12 19:14:36 dev kernel: 0 pages HighMem/MovableOnly
Feb 12 19:14:36 dev kernel: 11910 pages reserved
Feb 12 19:14:36 dev kernel: 0 pages hwpoisoned
Feb 12 19:14:36 dev kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 12 19:14:36 dev kernel: [  374]     0   374    13724       95      31       4        0             0 systemd-journal
Feb 12 19:14:36 dev kernel: [  444]     0   444     6084       78      16       3        0             0 systemd-logind
Feb 12 19:14:36 dev kernel: [  473]    81   473    32144      165      33       3        0          -900 dbus-daemon
Feb 12 19:14:36 dev kernel: [  530]     0   530     2169       32      10       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  533]     0   533     2169       29       9       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  708]     0   708    23354     3084      45       3        0             0 dhclient
Feb 12 19:14:36 dev kernel: [ 2320]     0  2320    25094      104      38       3        0             0 gssproxy
Feb 12 19:14:36 dev kernel: [ 2445]    32  2445    16464      145      35       3        0             0 rpcbind
Feb 12 19:14:36 dev kernel: [ 2447]     0  2447     4291       45      14       3        0             0 rpc.idmapd
Feb 12 19:14:36 dev kernel: [ 2450]    29  2450    10605      208      26       3        0             0 rpc.statd
Feb 12 19:14:36 dev kernel: [ 2451]     0  2451    10652      183      24       3        0             0 rpc.mountd
Feb 12 19:14:36 dev kernel: [ 3468]     0  3468     5704      151      17       3        0             0 crond
Feb 12 19:14:36 dev kernel: [ 3644]     0  3644    11250      364      24       3        0         -1000 systemd-udevd
Feb 12 19:14:36 dev kernel: [ 4465]     0  4465    12268       92      25       3        0         -1000 auditd
Feb 12 19:14:36 dev kernel: [16458]   994 16458   124840     1169      76       3        0             0 qpidd
Feb 12 19:14:36 dev kernel: [18225]    23 18225    21336      894      39       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18227]    23 18227    24838     2670      47       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18229]    23 18229     6992       52      19       3        0             0 log_file_daemon
Feb 12 19:14:36 dev kernel: [18356]   184 18356   616450    14553     555       5        0             0 mongod
Feb 12 19:14:36 dev kernel: [18488]     0 18488    20354      207      41       3        0         -1000 sshd
Feb 12 19:14:36 dev kernel: [20551]     0 20551   313477     4014     103       4        0             0 python
Feb 12 19:14:36 dev kernel: [20552]    48 20552   127793    14594     183       4        0             0 pulp_streamer
Feb 12 19:14:36 dev kernel: [20553]    48 20553   144600    14097     181       3        0             0 celery
Feb 12 19:14:36 dev kernel: [20554]    48 20554   144116     8929     114       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20556]     0 20556    68831      798     137       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20561]    48 20561   144601    14084     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20563]    48 20563   144601    14088     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20565]    48 20565   144600    14076     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20571]    48 20571   144600    14082     180       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20577]    48 20577   292827    14987     294       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20578]    48 20578   292828    14691     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20580]    48 20580   292956    14886     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20581]    48 20581   140971     5912     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20582]    48 20582   140971     5912     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20583]    48 20583   157355     5912     195       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20584]    48 20584   221349    13428     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20585]    48 20585   237734    13428     283       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20586]    48 20586   221349    13379     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20587]    48 20587    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20588]    48 20588    72051     1395     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20590]    48 20590    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20593]    48 20593    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20595]    48 20595    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20807]    48 20807    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20853]    48 20853   161770    13916     174       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20854]    48 20854   145386    13916     173       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20856]    48 20856   162967    14063     176       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20857]    48 20857   403431   228560     605       5        0             0 celery
Feb 12 19:14:36 dev kernel: [20858]    48 20858   144541    14014     171       3        0             0 celery
Feb 12 19:14:36 dev kernel: [ 3128]     0  3128    37869      317      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3131]  1001  3131    11238      142      27       3        0             0 systemd
Feb 12 19:14:36 dev kernel: [ 3133]  1001  3133    43881      688      49       3        0             0 (sd-pam)
Feb 12 19:14:36 dev kernel: [ 3136]  1001  3136    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3137]  1001  3137     4333      578      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3443]     0  3443    37869      315      76       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3447]  1001  3447    37869      367      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3448]  1001  3448     4333      547      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3500]     0  3500    30367      216      62       3        0             0 sudo
Feb 12 19:14:36 dev kernel: [ 3501]     0  3501    13731       59      27       3        0             0 journalctl
Feb 12 19:14:36 dev kernel: [ 3565]     0  3565    37869      315      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3569]  1001  3569    37869      314      74       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3570]  1001  3570     4333      546      13       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3620]  1001  3620    13060     1084      31       3        0             0 dstat
Feb 12 19:14:36 dev kernel: [ 3717]     0  3717    37869      316      78       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3721]  1001  3721    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3722]  1001  3722     4333      555      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3775]  1001  3775    14124      166      32       3        0             0 top
Feb 12 19:14:36 dev kernel: [ 4001]    48  4001    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 4088]    48  4088    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 5344]  1001  5344    48752     5025      97       3        0             0 pulp-admin
Feb 12 19:14:36 dev kernel: Out of memory: Kill process 20857 (celery) score 447 or sacrifice child
Feb 12 19:14:36 dev kernel: Killed process 20857 (celery) total-vm:1613724kB, anon-rss:914236kB, file-rss:4kB
Feb 12 19:14:36 dev kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 12 19:14:36 dev kernel: httpd cpuset=/ mems_allowed=0
Feb 12 19:14:36 dev kernel: CPU: 2 PID: 20771 Comm: httpd Not tainted 4.2.3-300.fc23.x86_64 #1
Feb 12 19:14:36 dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
Feb 12 19:14:36 dev kernel:  0000000000000000 00000000c5cb6a30 ffff88000d8c7948 ffffffff81771fca
Feb 12 19:14:36 dev kernel:  0000000000000000 ffff88000f3e9c40 ffff88000d8c79d8 ffffffff8177150e
Feb 12 19:14:36 dev kernel:  65a4d4fb810b5ae5 fffeefff00000000 0000000000000002 000000000209e303
Feb 12 19:14:36 dev kernel: Call Trace:
Feb 12 19:14:36 dev kernel:  [<ffffffff81771fca>] dump_stack+0x45/0x57
Feb 12 19:14:36 dev kernel:  [<ffffffff8177150e>] dump_header+0x86/0x207
Feb 12 19:14:36 dev kernel:  [<ffffffff811a627b>] oom_kill_process+0x1db/0x3a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a69ec>] out_of_memory+0x54c/0x5a0
Feb 12 19:14:36 dev kernel:  [<ffffffff811ac9d8>] __alloc_pages_nodemask+0x838/0x980
Feb 12 19:14:36 dev kernel:  [<ffffffff811f4f71>] alloc_pages_current+0x91/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff811a2a0b>] __page_cache_alloc+0xab/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff811a4c34>] filemap_fault+0x154/0x410
Feb 12 19:14:36 dev kernel:  [<ffffffff811d10de>] __do_fault+0x4e/0xf0
Feb 12 19:14:36 dev kernel:  [<ffffffff811d6218>] handle_mm_fault+0xf58/0x17d0
Feb 12 19:14:36 dev kernel:  [<ffffffff8111b8a2>] ? do_futex+0x112/0xb00
Feb 12 19:14:36 dev kernel:  [<ffffffff81065447>] __do_page_fault+0x197/0x400
Feb 12 19:14:36 dev kernel:  [<ffffffff81065777>] trace_do_page_fault+0x47/0x100
Feb 12 19:14:36 dev kernel:  [<ffffffff8105fd2e>] do_async_page_fault+0x1e/0xc0
Feb 12 19:14:36 dev kernel:  [<ffffffff8177a968>] async_page_fault+0x28/0x30
Feb 12 19:14:36 dev kernel: Mem-Info:
Feb 12 19:14:36 dev kernel: active_anon:477500 inactive_anon:408 isolated_anon:0
                             active_file:31 inactive_file:0 isolated_file:0
                             unevictable:0 dirty:0 writeback:0 unstable:0
                             slab_reclaimable:6690 slab_unreclaimable:9554
                             mapped:291 shmem:479 pagetables:8316 bounce:0
                             free:3361 free_pcp:64 free_cma:0
Feb 12 19:14:36 dev kernel: Node 0 DMA free:7968kB min:40kB low:48kB high:60kB active_anon:6540kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB ml
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 1982 1982 1982
Feb 12 19:14:36 dev kernel: Node 0 DMA32 free:82140kB min:5580kB low:6972kB high:8368kB active_anon:1825712kB inactive_anon:1624kB active_file:124kB inactive_file:332kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 0 0 0
Feb 12 19:14:36 dev kernel: Node 0 DMA: 63*4kB (UEM) 15*8kB (UEM) 29*16kB (E) 21*32kB (EM) 9*64kB (UEM) 4*128kB (UM) 2*256kB (UM) 2*512kB (UM) 0*1024kB 2*2048kB (UM) 0*4096kB = 8228kB
Feb 12 19:14:36 dev kernel: Node 0 DMA32: 15988*4kB (UEM) 4728*8kB (UEM) 992*16kB (UEM) 144*32kB (UEM) 17*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 123472kB
Feb 12 19:14:36 dev kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 12 19:14:36 dev kernel: 945 total pagecache pages
Feb 12 19:14:36 dev kernel: 0 pages in swap cache
Feb 12 19:14:36 dev kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 12 19:14:36 dev kernel: Free swap  = 0kB
Feb 12 19:14:36 dev kernel: Total swap = 0kB
Feb 12 19:14:36 dev kernel: 524158 pages RAM
Feb 12 19:14:36 dev kernel: 0 pages HighMem/MovableOnly
Feb 12 19:14:36 dev kernel: 11910 pages reserved
Feb 12 19:14:36 dev kernel: 0 pages hwpoisoned
Feb 12 19:14:36 dev kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 12 19:14:36 dev kernel: [  374]     0   374    13724       95      31       4        0             0 systemd-journal
Feb 12 19:14:36 dev kernel: [  444]     0   444     6084       78      16       3        0             0 systemd-logind
Feb 12 19:14:36 dev kernel: [  473]    81   473    32144      165      33       3        0          -900 dbus-daemon
Feb 12 19:14:36 dev kernel: [  530]     0   530     2169       32      10       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  533]     0   533     2169       29       9       3        0             0 agetty
Feb 12 19:14:36 dev kernel: [  708]     0   708    23354     3084      45       3        0             0 dhclient
Feb 12 19:14:36 dev kernel: [ 2320]     0  2320    25094      104      38       3        0             0 gssproxy
Feb 12 19:14:36 dev kernel: [ 2445]    32  2445    16464      145      35       3        0             0 rpcbind
Feb 12 19:14:36 dev kernel: [ 2447]     0  2447     4291       45      14       3        0             0 rpc.idmapd
Feb 12 19:14:36 dev kernel: [ 2450]    29  2450    10605      208      26       3        0             0 rpc.statd
Feb 12 19:14:36 dev kernel: [ 2451]     0  2451    10652      183      24       3        0             0 rpc.mountd
Feb 12 19:14:36 dev kernel: [ 3468]     0  3468     5704      151      17       3        0             0 crond
Feb 12 19:14:36 dev kernel: [ 3644]     0  3644    11250      364      24       3        0         -1000 systemd-udevd
Feb 12 19:14:36 dev kernel: [ 4465]     0  4465    12268       92      25       3        0         -1000 auditd
Feb 12 19:14:36 dev kernel: [16458]   994 16458   124840     1169      76       3        0             0 qpidd
Feb 12 19:14:36 dev kernel: [18225]    23 18225    21336      894      39       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18227]    23 18227    24838     2670      47       3        0             0 squid
Feb 12 19:14:36 dev kernel: [18229]    23 18229     6992       52      19       3        0             0 log_file_daemon
Feb 12 19:14:36 dev kernel: [18356]   184 18356   616450    14553     555       5        0             0 mongod
Feb 12 19:14:36 dev kernel: [18488]     0 18488    20354      207      41       3        0         -1000 sshd
Feb 12 19:14:36 dev kernel: [20551]     0 20551   313477     4014     103       4        0             0 python
Feb 12 19:14:36 dev kernel: [20552]    48 20552   127793    14593     183       4        0             0 pulp_streamer
Feb 12 19:14:36 dev kernel: [20553]    48 20553   144600    14097     181       3        0             0 celery
Feb 12 19:14:36 dev kernel: [20554]    48 20554   144116     8929     114       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20556]     0 20556    68831      798     137       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20561]    48 20561   144601    14084     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20563]    48 20563   144601    14088     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20565]    48 20565   144600    14076     183       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20571]    48 20571   144600    14082     180       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20577]    48 20577   292827    14986     294       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20578]    48 20578   292828    14690     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20580]    48 20580   292956    15204     293       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20581]    48 20581   140971     5911     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20582]    48 20582   140971     5912     194       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20583]    48 20583   157355     5911     195       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20584]    48 20584   221349    13427     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20585]    48 20585   237734    13427     283       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20586]    48 20586   221349    13378     282       4        0             0 httpd
Feb 12 19:14:36 dev kernel: [20587]    48 20587    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20588]    48 20588    72051     1395     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20590]    48 20590    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20593]    48 20593    72028     1364     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20595]    48 20595    72026     1363     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20807]    48 20807    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [20853]    48 20853   161770    13916     174       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20854]    48 20854   145386    13916     173       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20856]    48 20856   162967    14063     176       4        0             0 celery
Feb 12 19:14:36 dev kernel: [20858]    48 20858   144541    14014     171       3        0             0 celery
Feb 12 19:14:36 dev kernel: [ 3128]     0  3128    37869      317      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3131]  1001  3131    11238      142      27       3        0             0 systemd
Feb 12 19:14:36 dev kernel: [ 3133]  1001  3133    43881      688      49       3        0             0 (sd-pam)
Feb 12 19:14:36 dev kernel: [ 3136]  1001  3136    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3137]  1001  3137     4333      578      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3443]     0  3443    37869      315      76       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3447]  1001  3447    37869      367      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3448]  1001  3448     4333      547      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3500]     0  3500    30367      216      62       3        0             0 sudo
Feb 12 19:14:36 dev kernel: [ 3501]     0  3501    13731       59      27       3        0             0 journalctl
Feb 12 19:14:36 dev kernel: [ 3565]     0  3565    37869      315      77       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3569]  1001  3569    37869      314      74       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3570]  1001  3570     4333      546      13       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3620]  1001  3620    13060     1084      31       3        0             0 dstat
Feb 12 19:14:36 dev kernel: [ 3717]     0  3717    37869      316      78       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3721]  1001  3721    37869      315      75       3        0             0 sshd
Feb 12 19:14:36 dev kernel: [ 3722]  1001  3722     4333      555      14       3        0             0 bash
Feb 12 19:14:36 dev kernel: [ 3775]  1001  3775    14124      166      32       3        0             0 top
Feb 12 19:14:36 dev kernel: [ 4001]    48  4001    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 4088]    48  4088    72026     1350     140       3        0             0 httpd
Feb 12 19:14:36 dev kernel: [ 5344]  1001  5344    48752     5025      97       3        0             0 pulp-admin
Feb 12 19:14:36 dev kernel: Out of memory: Kill process 2099 (celery) score 447 or sacrifice child
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Task pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd] raised unexpected: WorkerLostError('Worker exited prematurel
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Traceback (most recent call last):
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)   File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)     human_status(exitcode)),
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forkin
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)   "MongoClient opened before fork. Create MongoClient "
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)
Feb 12 19:14:36 dev celery[20561]: reserved_resource_worker-0@dev ready.
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5] succeeded in 0.351034021005s: None
Feb 12 19:31:16 dev sudo[6302]:  vagrant : TTY=pts/0 ; PWD=/home/vagrant ; USER=root ; COMMAND=/bin/systemctl status pulp_worker-0

The celery parent process is still running after the OOM:

$ sudo systemctl status pulp_worker-0 
● pulp_worker-0.service - Pulp Worker #0
   Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2016-02-11 22:43:22 UTC; 20h ago
 Main PID: 20561 (celery)
   CGroup: /system.slice/pulp_worker-0.service
           ├─ 5369 /bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
           └─20561 /bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30

Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Task pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd] raised unexpected: WorkerLostError('Wor...al 9 (SIGKILL).',)
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Traceback (most recent call last):
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)   File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776)     human_status(exitcode)),
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after for...
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)   "MongoClient opened before fork. Create MongoClient "
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)
Feb 12 19:14:36 dev celery[20561]: reserved_resource_worker-0@dev ready.
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5] succeeded in 0.351034021005s: None
Hint: Some lines were ellipsized, use -l to show in full.

And here you can see that Pulp still thinks the task is running:

$ pulp-admin tasks list
+----------------------------------------------------------------------+
                                 Tasks
+----------------------------------------------------------------------+
Operations:  associate
Resources:   ohno (repository), rhel7 (repository)
State:       Running
Start Time:  2016-02-12T19:14:10Z
Finish Time: Incomplete
Task Id:     6785cda1-4237-4ae7-aad2-313e8d3b77dd

My theory is that Celery's parent process noticed that its child died and automatically started a new one to replace it for us. Since our worker watcher just makes sure that worker 0 is hearbeating on an interval, it is none the wiser that the process has been replaced. Unfortunately, nothing seems to notice that the task was lost and thus it will remain in the running state forever, despite being lost. Unfortunately, I don't have a proposal for what we could do differently as I need to fix my original issue at the moment.

As for triage, this is probably an issue we've had for all releases 2.4+, so I don't think it's a 2.8.0 blocker.


Related issues

Related to Pulp - Issue #2045: Task stuck at waiting if child process segfaultsCLOSED - DUPLICATE<a title="Actions" class="icon-only icon-actions js-contextmenu" href="#">Actions</a>

Associated revisions

Revision 1fe832ce View on GitHub
Added by bmbouter almost 5 years ago

_release_resource cleans up killed worker tasks

If a child Celery process is killed, Celery would recover but the TaskStatus would stay at running forever. This adds an additional check when the resource lock is removed which handles cleanup for any worker being killed.

Because the _release_resource is being run, we know the task itself has exited, yet if the TaskStatus is not at a final state, the worker certainly errored.

https://pulp.plan.io/issues/1673 fixes #1673

Revision 1fe832ce View on GitHub
Added by bmbouter almost 5 years ago

_release_resource cleans up killed worker tasks

If a child Celery process is killed, Celery would recover but the TaskStatus would stay at running forever. This adds an additional check when the resource lock is removed which handles cleanup for any worker being killed.

Because the _release_resource is being run, we know the task itself has exited, yet if the TaskStatus is not at a final state, the worker certainly errored.

https://pulp.plan.io/issues/1673 fixes #1673

History

#1 Updated by rbarlow over 5 years ago

One potential fix may be to log the PIDs of the child processes and notice when they change. Of course, there is a small possibility of it being replaced with the same PID as the previous process, so it's not a bullet proof idea, but would be easy to implement and better than what we have now.

#2 Updated by bmbouter over 5 years ago

@rbarlow, When the parent process spawns a new task to replace the one that was killed with the OOM, when the new task completes I expect the TaskStatus object will be updated to finished. Perhaps in your case since your task is deterministicly being killed by the OOM, your specific install won't leave the running state. Did your newly spawned task complete?

#3 Updated by rbarlow over 5 years ago

Hey Brian,

I do not believe there ever is a newly spawned task. I let it run for ~15 minutes and the system sat idle the whole time. It seems that the task is lost.

#4 Updated by bmbouter over 5 years ago

We should look into the expected celery behavior here. I asked in #celery about this, but I haven't heard back. We could also test it ourselves to confirm that the second task is not spawned. Based on your observation it sounds like it is not.

#5 Updated by mhrivnak over 5 years ago

  • Priority changed from Normal to High
  • Severity changed from 2. Medium to 3. High
  • Triaged changed from No to Yes

#6 Updated by bmbouter almost 5 years ago

  • Related to Issue #2045: Task stuck at waiting if child process segfaults added

#7 Updated by bmbouter almost 5 years ago

I've outlined 3 possible solutions to this bug when I filed a duplicate on it. See the possible solutions here: https://pulp.plan.io/issues/2045#note-1

#8 Updated by bmbouter almost 5 years ago

  • Status changed from NEW to ASSIGNED
  • Assignee set to bmbouter
  • Sprint/Milestone set to 23

#9 Updated by bmbouter almost 5 years ago

After some discussion in #celery, there is not a signal that allows us to execute custom logic when a WorkerLostError is raised. Celery does its handling of this error here [0]. We don't have an opportunity to override that without monkey patching or subclassing objects and convincing Celery to use ours instead.

It's likely the best that the release_resource checks that the TaskStatus is at a final state and if not mark it as canceled.

[0]: https://github.com/celery/celery/blob/f9380fd8a5c9b646067436a858ce14995eab6ebc/celery/worker/request.py#L349-L381

#11 Updated by bmbouter almost 5 years ago

  • Status changed from ASSIGNED to POST
  • Platform Release set to 2.9.1

#12 Updated by bmbouter almost 5 years ago

  • Status changed from POST to MODIFIED
  • % Done changed from 0 to 100

#13 Updated by semyers almost 5 years ago

  • Status changed from MODIFIED to CLOSED - CURRENTRELEASE

#14 Updated by bmbouter over 3 years ago

  • Sprint set to Sprint 5

#15 Updated by bmbouter over 3 years ago

  • Sprint/Milestone deleted (23)

#16 Updated by bmbouter about 2 years ago

  • Tags Pulp 2 added

Please register to edit this issue

Also available in: Atom PDF