Issue #1673
closedPulp's worker watcher does not notice workers that got killed by OOM killer and their tasks stay "running" forever
Description
In investing #1662 I had a worker get killed by the OOM killer. After this happened, the worker watcher did not become aware and so did not cancel the task that was now lost. Thus, the task stayed in a running state forever. Here is the log output of the worker getting killed by OOM killer:
Feb 12 19:14:10 dev pulp[20553]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._queue_reserved_task[20f6a5df-babd-4d2c-a0b1-2df849dd9414]
Feb 12 19:14:10 dev pulp[20561]: celery.worker.strategy:INFO: Received task: pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd]
Feb 12 19:14:10 dev pulp[20561]: celery.worker.strategy:INFO: Received task: pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5]
Feb 12 19:14:10 dev pulp[20553]: celery.worker.job:INFO: Task pulp.server.async.tasks._queue_reserved_task[20f6a5df-babd-4d2c-a0b1-2df849dd9414] succeeded in 0.0379113480012s: None
Feb 12 19:14:36 dev kernel: celery invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 12 19:14:36 dev kernel: celery cpuset=/ mems_allowed=0
Feb 12 19:14:36 dev kernel: CPU: 0 PID: 20751 Comm: celery Not tainted 4.2.3-300.fc23.x86_64 #1
Feb 12 19:14:36 dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
Feb 12 19:14:36 dev kernel: 0000000000000000 000000004e0abf34 ffff88001a627948 ffffffff81771fca
Feb 12 19:14:36 dev kernel: 0000000000000000 ffff880075ba9c40 ffff88001a6279d8 ffffffff8177150e
Feb 12 19:14:36 dev kernel: 65a4d4fb1a6279a8 fffeefff00000000 0000000000000002 000000000209e203
Feb 12 19:14:36 dev kernel: Call Trace:
Feb 12 19:14:36 dev kernel: [<ffffffff81771fca>] dump_stack+0x45/0x57
Feb 12 19:14:36 dev kernel: [<ffffffff8177150e>] dump_header+0x86/0x207
Feb 12 19:14:36 dev kernel: [<ffffffff811a627b>] oom_kill_process+0x1db/0x3a0
Feb 12 19:14:36 dev kernel: [<ffffffff811a69ec>] out_of_memory+0x54c/0x5a0
Feb 12 19:14:36 dev kernel: [<ffffffff811ac9d8>] __alloc_pages_nodemask+0x838/0x980
Feb 12 19:14:36 dev kernel: [<ffffffff811f4f71>] alloc_pages_current+0x91/0x100
Feb 12 19:14:36 dev kernel: [<ffffffff811a2a0b>] __page_cache_alloc+0xab/0xc0
Feb 12 19:14:36 dev kernel: [<ffffffff811a4c34>] filemap_fault+0x154/0x410
Feb 12 19:14:36 dev kernel: [<ffffffff811d10de>] __do_fault+0x4e/0xf0
Feb 12 19:14:36 dev kernel: [<ffffffff811d6218>] handle_mm_fault+0xf58/0x17d0
Feb 12 19:14:36 dev kernel: [<ffffffff81643efb>] ? sock_recvmsg+0x3b/0x50
Feb 12 19:14:36 dev kernel: [<ffffffff81065447>] __do_page_fault+0x197/0x400
Feb 12 19:14:36 dev kernel: [<ffffffff81065777>] trace_do_page_fault+0x47/0x100
Feb 12 19:14:36 dev kernel: [<ffffffff8105fd2e>] do_async_page_fault+0x1e/0xc0
Feb 12 19:14:36 dev kernel: [<ffffffff8177a968>] async_page_fault+0x28/0x30
Feb 12 19:14:36 dev kernel: Mem-Info:
Feb 12 19:14:36 dev kernel: active_anon:477508 inactive_anon:408 isolated_anon:0
active_file:34 inactive_file:0 isolated_file:17
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:6716 slab_unreclaimable:9582
mapped:353 shmem:479 pagetables:8316 bounce:0
free:3374 free_pcp:35 free_cma:0
Feb 12 19:14:36 dev kernel: Node 0 DMA free:7968kB min:40kB low:48kB high:60kB active_anon:6540kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB ml
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 1982 1982 1982
Feb 12 19:14:36 dev kernel: Node 0 DMA32 free:5528kB min:5580kB low:6972kB high:8368kB active_anon:1903492kB inactive_anon:1624kB active_file:136kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 0 0 0
Feb 12 19:14:36 dev kernel: Node 0 DMA: 14*4kB (UEM) 7*8kB (UEM) 29*16kB (E) 21*32kB (EM) 9*64kB (UEM) 4*128kB (UM) 2*256kB (UM) 2*512kB (UM) 0*1024kB 2*2048kB (UM) 0*4096kB = 7968kB
Feb 12 19:14:36 dev kernel: Node 0 DMA32: 1411*4kB (UM) 6*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5708kB
Feb 12 19:14:36 dev kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 12 19:14:36 dev kernel: 539 total pagecache pages
Feb 12 19:14:36 dev kernel: 0 pages in swap cache
Feb 12 19:14:36 dev kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 12 19:14:36 dev kernel: Free swap = 0kB
Feb 12 19:14:36 dev kernel: Total swap = 0kB
Feb 12 19:14:36 dev kernel: 524158 pages RAM
Feb 12 19:14:36 dev kernel: 0 pages HighMem/MovableOnly
Feb 12 19:14:36 dev kernel: 11910 pages reserved
Feb 12 19:14:36 dev kernel: 0 pages hwpoisoned
Feb 12 19:14:36 dev kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 12 19:14:36 dev kernel: [ 374] 0 374 13724 95 31 4 0 0 systemd-journal
Feb 12 19:14:36 dev kernel: [ 444] 0 444 6084 78 16 3 0 0 systemd-logind
Feb 12 19:14:36 dev kernel: [ 473] 81 473 32144 165 33 3 0 -900 dbus-daemon
Feb 12 19:14:36 dev kernel: [ 530] 0 530 2169 32 10 3 0 0 agetty
Feb 12 19:14:36 dev kernel: [ 533] 0 533 2169 29 9 3 0 0 agetty
Feb 12 19:14:36 dev kernel: [ 708] 0 708 23354 3084 45 3 0 0 dhclient
Feb 12 19:14:36 dev kernel: [ 2320] 0 2320 25094 104 38 3 0 0 gssproxy
Feb 12 19:14:36 dev kernel: [ 2445] 32 2445 16464 145 35 3 0 0 rpcbind
Feb 12 19:14:36 dev kernel: [ 2447] 0 2447 4291 45 14 3 0 0 rpc.idmapd
Feb 12 19:14:36 dev kernel: [ 2450] 29 2450 10605 208 26 3 0 0 rpc.statd
Feb 12 19:14:36 dev kernel: [ 2451] 0 2451 10652 183 24 3 0 0 rpc.mountd
Feb 12 19:14:36 dev kernel: [ 3468] 0 3468 5704 151 17 3 0 0 crond
Feb 12 19:14:36 dev kernel: [ 3644] 0 3644 11250 364 24 3 0 -1000 systemd-udevd
Feb 12 19:14:36 dev kernel: [ 4465] 0 4465 12268 92 25 3 0 -1000 auditd
Feb 12 19:14:36 dev kernel: [16458] 994 16458 124840 1169 76 3 0 0 qpidd
Feb 12 19:14:36 dev kernel: [18225] 23 18225 21336 894 39 3 0 0 squid
Feb 12 19:14:36 dev kernel: [18227] 23 18227 24838 2670 47 3 0 0 squid
Feb 12 19:14:36 dev kernel: [18229] 23 18229 6992 52 19 3 0 0 log_file_daemon
Feb 12 19:14:36 dev kernel: [18356] 184 18356 616450 14553 555 5 0 0 mongod
Feb 12 19:14:36 dev kernel: [18488] 0 18488 20354 207 41 3 0 -1000 sshd
Feb 12 19:14:36 dev kernel: [20551] 0 20551 313477 4014 103 4 0 0 python
Feb 12 19:14:36 dev kernel: [20552] 48 20552 127793 14594 183 4 0 0 pulp_streamer
Feb 12 19:14:36 dev kernel: [20553] 48 20553 144600 14097 181 3 0 0 celery
Feb 12 19:14:36 dev kernel: [20554] 48 20554 144116 8929 114 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20556] 0 20556 68831 798 137 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20561] 48 20561 144601 14084 183 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20563] 48 20563 144601 14088 183 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20565] 48 20565 144600 14076 183 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20571] 48 20571 144600 14082 180 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20577] 48 20577 292827 14987 294 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20578] 48 20578 292828 14691 293 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20580] 48 20580 292956 14886 293 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20581] 48 20581 140971 5912 194 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20582] 48 20582 140971 5912 194 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20583] 48 20583 157355 5912 195 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20584] 48 20584 221349 13428 282 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20585] 48 20585 237734 13428 283 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20586] 48 20586 221349 13379 282 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20587] 48 20587 72028 1364 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20588] 48 20588 72051 1395 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20590] 48 20590 72026 1363 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20593] 48 20593 72028 1364 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20595] 48 20595 72026 1363 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20807] 48 20807 72026 1350 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20853] 48 20853 161770 13916 174 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20854] 48 20854 145386 13916 173 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20856] 48 20856 162967 14063 176 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20857] 48 20857 403431 228560 605 5 0 0 celery
Feb 12 19:14:36 dev kernel: [20858] 48 20858 144541 14014 171 3 0 0 celery
Feb 12 19:14:36 dev kernel: [ 3128] 0 3128 37869 317 77 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3131] 1001 3131 11238 142 27 3 0 0 systemd
Feb 12 19:14:36 dev kernel: [ 3133] 1001 3133 43881 688 49 3 0 0 (sd-pam)
Feb 12 19:14:36 dev kernel: [ 3136] 1001 3136 37869 315 75 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3137] 1001 3137 4333 578 14 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3443] 0 3443 37869 315 76 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3447] 1001 3447 37869 367 75 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3448] 1001 3448 4333 547 14 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3500] 0 3500 30367 216 62 3 0 0 sudo
Feb 12 19:14:36 dev kernel: [ 3501] 0 3501 13731 59 27 3 0 0 journalctl
Feb 12 19:14:36 dev kernel: [ 3565] 0 3565 37869 315 77 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3569] 1001 3569 37869 314 74 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3570] 1001 3570 4333 546 13 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3620] 1001 3620 13060 1084 31 3 0 0 dstat
Feb 12 19:14:36 dev kernel: [ 3717] 0 3717 37869 316 78 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3721] 1001 3721 37869 315 75 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3722] 1001 3722 4333 555 14 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3775] 1001 3775 14124 166 32 3 0 0 top
Feb 12 19:14:36 dev kernel: [ 4001] 48 4001 72026 1350 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [ 4088] 48 4088 72026 1350 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [ 5344] 1001 5344 48752 5025 97 3 0 0 pulp-admin
Feb 12 19:14:36 dev kernel: Out of memory: Kill process 20857 (celery) score 447 or sacrifice child
Feb 12 19:14:36 dev kernel: Killed process 20857 (celery) total-vm:1613724kB, anon-rss:914236kB, file-rss:4kB
Feb 12 19:14:36 dev kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 12 19:14:36 dev kernel: httpd cpuset=/ mems_allowed=0
Feb 12 19:14:36 dev kernel: CPU: 2 PID: 20771 Comm: httpd Not tainted 4.2.3-300.fc23.x86_64 #1
Feb 12 19:14:36 dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
Feb 12 19:14:36 dev kernel: 0000000000000000 00000000c5cb6a30 ffff88000d8c7948 ffffffff81771fca
Feb 12 19:14:36 dev kernel: 0000000000000000 ffff88000f3e9c40 ffff88000d8c79d8 ffffffff8177150e
Feb 12 19:14:36 dev kernel: 65a4d4fb810b5ae5 fffeefff00000000 0000000000000002 000000000209e303
Feb 12 19:14:36 dev kernel: Call Trace:
Feb 12 19:14:36 dev kernel: [<ffffffff81771fca>] dump_stack+0x45/0x57
Feb 12 19:14:36 dev kernel: [<ffffffff8177150e>] dump_header+0x86/0x207
Feb 12 19:14:36 dev kernel: [<ffffffff811a627b>] oom_kill_process+0x1db/0x3a0
Feb 12 19:14:36 dev kernel: [<ffffffff811a69ec>] out_of_memory+0x54c/0x5a0
Feb 12 19:14:36 dev kernel: [<ffffffff811ac9d8>] __alloc_pages_nodemask+0x838/0x980
Feb 12 19:14:36 dev kernel: [<ffffffff811f4f71>] alloc_pages_current+0x91/0x100
Feb 12 19:14:36 dev kernel: [<ffffffff811a2a0b>] __page_cache_alloc+0xab/0xc0
Feb 12 19:14:36 dev kernel: [<ffffffff811a4c34>] filemap_fault+0x154/0x410
Feb 12 19:14:36 dev kernel: [<ffffffff811d10de>] __do_fault+0x4e/0xf0
Feb 12 19:14:36 dev kernel: [<ffffffff811d6218>] handle_mm_fault+0xf58/0x17d0
Feb 12 19:14:36 dev kernel: [<ffffffff8111b8a2>] ? do_futex+0x112/0xb00
Feb 12 19:14:36 dev kernel: [<ffffffff81065447>] __do_page_fault+0x197/0x400
Feb 12 19:14:36 dev kernel: [<ffffffff81065777>] trace_do_page_fault+0x47/0x100
Feb 12 19:14:36 dev kernel: [<ffffffff8105fd2e>] do_async_page_fault+0x1e/0xc0
Feb 12 19:14:36 dev kernel: [<ffffffff8177a968>] async_page_fault+0x28/0x30
Feb 12 19:14:36 dev kernel: Mem-Info:
Feb 12 19:14:36 dev kernel: active_anon:477500 inactive_anon:408 isolated_anon:0
active_file:31 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:6690 slab_unreclaimable:9554
mapped:291 shmem:479 pagetables:8316 bounce:0
free:3361 free_pcp:64 free_cma:0
Feb 12 19:14:36 dev kernel: Node 0 DMA free:7968kB min:40kB low:48kB high:60kB active_anon:6540kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB ml
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 1982 1982 1982
Feb 12 19:14:36 dev kernel: Node 0 DMA32 free:82140kB min:5580kB low:6972kB high:8368kB active_anon:1825712kB inactive_anon:1624kB active_file:124kB inactive_file:332kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640
Feb 12 19:14:36 dev kernel: lowmem_reserve[]: 0 0 0 0
Feb 12 19:14:36 dev kernel: Node 0 DMA: 63*4kB (UEM) 15*8kB (UEM) 29*16kB (E) 21*32kB (EM) 9*64kB (UEM) 4*128kB (UM) 2*256kB (UM) 2*512kB (UM) 0*1024kB 2*2048kB (UM) 0*4096kB = 8228kB
Feb 12 19:14:36 dev kernel: Node 0 DMA32: 15988*4kB (UEM) 4728*8kB (UEM) 992*16kB (UEM) 144*32kB (UEM) 17*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 123472kB
Feb 12 19:14:36 dev kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Feb 12 19:14:36 dev kernel: 945 total pagecache pages
Feb 12 19:14:36 dev kernel: 0 pages in swap cache
Feb 12 19:14:36 dev kernel: Swap cache stats: add 0, delete 0, find 0/0
Feb 12 19:14:36 dev kernel: Free swap = 0kB
Feb 12 19:14:36 dev kernel: Total swap = 0kB
Feb 12 19:14:36 dev kernel: 524158 pages RAM
Feb 12 19:14:36 dev kernel: 0 pages HighMem/MovableOnly
Feb 12 19:14:36 dev kernel: 11910 pages reserved
Feb 12 19:14:36 dev kernel: 0 pages hwpoisoned
Feb 12 19:14:36 dev kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Feb 12 19:14:36 dev kernel: [ 374] 0 374 13724 95 31 4 0 0 systemd-journal
Feb 12 19:14:36 dev kernel: [ 444] 0 444 6084 78 16 3 0 0 systemd-logind
Feb 12 19:14:36 dev kernel: [ 473] 81 473 32144 165 33 3 0 -900 dbus-daemon
Feb 12 19:14:36 dev kernel: [ 530] 0 530 2169 32 10 3 0 0 agetty
Feb 12 19:14:36 dev kernel: [ 533] 0 533 2169 29 9 3 0 0 agetty
Feb 12 19:14:36 dev kernel: [ 708] 0 708 23354 3084 45 3 0 0 dhclient
Feb 12 19:14:36 dev kernel: [ 2320] 0 2320 25094 104 38 3 0 0 gssproxy
Feb 12 19:14:36 dev kernel: [ 2445] 32 2445 16464 145 35 3 0 0 rpcbind
Feb 12 19:14:36 dev kernel: [ 2447] 0 2447 4291 45 14 3 0 0 rpc.idmapd
Feb 12 19:14:36 dev kernel: [ 2450] 29 2450 10605 208 26 3 0 0 rpc.statd
Feb 12 19:14:36 dev kernel: [ 2451] 0 2451 10652 183 24 3 0 0 rpc.mountd
Feb 12 19:14:36 dev kernel: [ 3468] 0 3468 5704 151 17 3 0 0 crond
Feb 12 19:14:36 dev kernel: [ 3644] 0 3644 11250 364 24 3 0 -1000 systemd-udevd
Feb 12 19:14:36 dev kernel: [ 4465] 0 4465 12268 92 25 3 0 -1000 auditd
Feb 12 19:14:36 dev kernel: [16458] 994 16458 124840 1169 76 3 0 0 qpidd
Feb 12 19:14:36 dev kernel: [18225] 23 18225 21336 894 39 3 0 0 squid
Feb 12 19:14:36 dev kernel: [18227] 23 18227 24838 2670 47 3 0 0 squid
Feb 12 19:14:36 dev kernel: [18229] 23 18229 6992 52 19 3 0 0 log_file_daemon
Feb 12 19:14:36 dev kernel: [18356] 184 18356 616450 14553 555 5 0 0 mongod
Feb 12 19:14:36 dev kernel: [18488] 0 18488 20354 207 41 3 0 -1000 sshd
Feb 12 19:14:36 dev kernel: [20551] 0 20551 313477 4014 103 4 0 0 python
Feb 12 19:14:36 dev kernel: [20552] 48 20552 127793 14593 183 4 0 0 pulp_streamer
Feb 12 19:14:36 dev kernel: [20553] 48 20553 144600 14097 181 3 0 0 celery
Feb 12 19:14:36 dev kernel: [20554] 48 20554 144116 8929 114 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20556] 0 20556 68831 798 137 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20561] 48 20561 144601 14084 183 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20563] 48 20563 144601 14088 183 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20565] 48 20565 144600 14076 183 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20571] 48 20571 144600 14082 180 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20577] 48 20577 292827 14986 294 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20578] 48 20578 292828 14690 293 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20580] 48 20580 292956 15204 293 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20581] 48 20581 140971 5911 194 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20582] 48 20582 140971 5912 194 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20583] 48 20583 157355 5911 195 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20584] 48 20584 221349 13427 282 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20585] 48 20585 237734 13427 283 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20586] 48 20586 221349 13378 282 4 0 0 httpd
Feb 12 19:14:36 dev kernel: [20587] 48 20587 72028 1364 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20588] 48 20588 72051 1395 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20590] 48 20590 72026 1363 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20593] 48 20593 72028 1364 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20595] 48 20595 72026 1363 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20807] 48 20807 72026 1350 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [20853] 48 20853 161770 13916 174 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20854] 48 20854 145386 13916 173 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20856] 48 20856 162967 14063 176 4 0 0 celery
Feb 12 19:14:36 dev kernel: [20858] 48 20858 144541 14014 171 3 0 0 celery
Feb 12 19:14:36 dev kernel: [ 3128] 0 3128 37869 317 77 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3131] 1001 3131 11238 142 27 3 0 0 systemd
Feb 12 19:14:36 dev kernel: [ 3133] 1001 3133 43881 688 49 3 0 0 (sd-pam)
Feb 12 19:14:36 dev kernel: [ 3136] 1001 3136 37869 315 75 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3137] 1001 3137 4333 578 14 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3443] 0 3443 37869 315 76 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3447] 1001 3447 37869 367 75 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3448] 1001 3448 4333 547 14 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3500] 0 3500 30367 216 62 3 0 0 sudo
Feb 12 19:14:36 dev kernel: [ 3501] 0 3501 13731 59 27 3 0 0 journalctl
Feb 12 19:14:36 dev kernel: [ 3565] 0 3565 37869 315 77 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3569] 1001 3569 37869 314 74 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3570] 1001 3570 4333 546 13 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3620] 1001 3620 13060 1084 31 3 0 0 dstat
Feb 12 19:14:36 dev kernel: [ 3717] 0 3717 37869 316 78 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3721] 1001 3721 37869 315 75 3 0 0 sshd
Feb 12 19:14:36 dev kernel: [ 3722] 1001 3722 4333 555 14 3 0 0 bash
Feb 12 19:14:36 dev kernel: [ 3775] 1001 3775 14124 166 32 3 0 0 top
Feb 12 19:14:36 dev kernel: [ 4001] 48 4001 72026 1350 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [ 4088] 48 4088 72026 1350 140 3 0 0 httpd
Feb 12 19:14:36 dev kernel: [ 5344] 1001 5344 48752 5025 97 3 0 0 pulp-admin
Feb 12 19:14:36 dev kernel: Out of memory: Kill process 2099 (celery) score 447 or sacrifice child
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Task pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd] raised unexpected: WorkerLostError('Worker exited prematurel
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Traceback (most recent call last):
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) human_status(exitcode)),
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forkin
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) "MongoClient opened before fork. Create MongoClient "
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)
Feb 12 19:14:36 dev celery[20561]: reserved_resource_worker-0@dev ready.
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5] succeeded in 0.351034021005s: None
Feb 12 19:31:16 dev sudo[6302]: vagrant : TTY=pts/0 ; PWD=/home/vagrant ; USER=root ; COMMAND=/bin/systemctl status pulp_worker-0
The celery parent process is still running after the OOM:
$ sudo systemctl status pulp_worker-0
● pulp_worker-0.service - Pulp Worker #0
Loaded: loaded (/run/systemd/system/pulp_worker-0.service; static; vendor preset: disabled)
Active: active (running) since Thu 2016-02-11 22:43:22 UTC; 20h ago
Main PID: 20561 (celery)
CGroup: /system.slice/pulp_worker-0.service
├─ 5369 /bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
└─20561 /bin/python /usr/bin/celery worker -n reserved_resource_worker-0@%h -A pulp.server.async.app -c 1 --events --umask 18 --pidfile=/var/run/pulp/reserved_resource_worker-0.pid --heartbeat-interval=30
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Task pulp.server.managers.repo.unit_association.associate_from_repo[6785cda1-4237-4ae7-aad2-313e8d3b77dd] raised unexpected: WorkerLostError('Wor...al 9 (SIGKILL).',)
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) Traceback (most recent call last):
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) File "/usr/lib64/python2.7/site-packages/billiard/pool.py", line 1171, in mark_as_worker_lost
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) human_status(exitcode)),
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:ERROR: (20561-35776) WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) /usr/lib64/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after for...
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776) "MongoClient opened before fork. Create MongoClient "
Feb 12 19:14:36 dev pulp[5369]: py.warnings:WARNING: (5369-35776)
Feb 12 19:14:36 dev celery[20561]: reserved_resource_worker-0@dev ready.
Feb 12 19:14:36 dev pulp[20561]: celery.worker.job:INFO: Task pulp.server.async.tasks._release_resource[de659087-c799-4365-ae5f-2957caf102b5] succeeded in 0.351034021005s: None
Hint: Some lines were ellipsized, use -l to show in full.
And here you can see that Pulp still thinks the task is running:
$ pulp-admin tasks list
+----------------------------------------------------------------------+
Tasks
+----------------------------------------------------------------------+
Operations: associate
Resources: ohno (repository), rhel7 (repository)
State: Running
Start Time: 2016-02-12T19:14:10Z
Finish Time: Incomplete
Task Id: 6785cda1-4237-4ae7-aad2-313e8d3b77dd
My theory is that Celery's parent process noticed that its child died and automatically started a new one to replace it for us. Since our worker watcher just makes sure that worker 0 is hearbeating on an interval, it is none the wiser that the process has been replaced. Unfortunately, nothing seems to notice that the task was lost and thus it will remain in the running state forever, despite being lost. Unfortunately, I don't have a proposal for what we could do differently as I need to fix my original issue at the moment.
As for triage, this is probably an issue we've had for all releases 2.4+, so I don't think it's a 2.8.0 blocker.
Related issues
_release_resource cleans up killed worker tasks
If a child Celery process is killed, Celery would recover but the TaskStatus would stay at running forever. This adds an additional check when the resource lock is removed which handles cleanup for any worker being killed.
Because the _release_resource is being run, we know the task itself has exited, yet if the TaskStatus is not at a final state, the worker certainly errored.
https://pulp.plan.io/issues/1673 fixes #1673