Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5297

free-pool-test may be OOM killed on jenkins.impala.io runs

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.9.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Infrastructure
    • Labels:
      None
    • Epic Color:
      ghx-label-2

      Description

      On gerrit-verify-dryrun jobs, while attempting to submit a change to update the Kudu version seems to cause the free-pool-test to run out of memory.

      The free-pool-test makes some large allocations (I think around 7gb in total), but when there are other processes running, it seems the gerrit jobs may be getting close to the 15gb CommitLimit on these aws hosts.

      Here's the output from the kern.log

      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153878] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153882] java cpuset=/ mems_allowed=0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153884] CPU: 1 PID: 19555 Comm: java Not tainted 3.13.0-100-generic #147-Ubuntu
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153886] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153887]  0000000000000000 ffff88066708f970 ffffffff8172a4bb ffff88047b3b1800
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153891]  0000000000000000 ffff88066708f9f8 ffffffff81724a5a 0000000000000000
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153894]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153897] Call Trace:
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153904]  [<ffffffff8172a4bb>] dump_stack+0x64/0x82
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153908]  [<ffffffff81724a5a>] dump_header+0x7f/0x1f1
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153912]  [<ffffffff81155d11>] oom_kill_process+0x201/0x360
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153917]  [<ffffffff812dcab5>] ? security_capable_noaudit+0x15/0x20
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153919]  [<ffffffff811564a1>] out_of_memory+0x471/0x4b0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153922]  [<ffffffff8115c7bc>] __alloc_pages_nodemask+0xa6c/0xb90
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153926]  [<ffffffff8119ae83>] alloc_pages_current+0xa3/0x160
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153930]  [<ffffffff811527c7>] __page_cache_alloc+0x97/0xc0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153932]  [<ffffffff81154235>] filemap_fault+0x185/0x410
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153936]  [<ffffffff8117944f>] __do_fault+0x6f/0x530
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153941]  [<ffffffff810135db>] ? __switch_to+0x16b/0x4f0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153943]  [<ffffffff8117d2a2>] handle_mm_fault+0x482/0xf00
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153947]  [<ffffffff81090df7>] ? hrtimer_try_to_cancel+0x47/0x100
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153950]  [<ffffffff8172df0e>] ? schedule_hrtimeout_range_clock+0xce/0x170
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153954]  [<ffffffff81736644>] __do_page_fault+0x184/0x560
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153957]  [<ffffffff8120a45f>] ? ep_poll+0x2ff/0x330
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153961]  [<ffffffff8109d2f0>] ? wake_up_state+0x20/0x20
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153964]  [<ffffffff81736a3a>] do_page_fault+0x1a/0x70
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153966]  [<ffffffff8120b5cc>] ? SyS_epoll_wait+0xac/0x100
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153968]  [<ffffffff81732d68>] page_fault+0x28/0x30
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153970] Mem-Info:
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153971] Node 0 DMA per-cpu:
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153973] CPU    0: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153974] CPU    1: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153975] CPU    2: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153976] CPU    3: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153977] CPU    4: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153978] CPU    5: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153979] CPU    6: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153980] CPU    7: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153981] CPU    8: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153982] CPU    9: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153984] CPU   10: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153985] CPU   11: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153986] CPU   12: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153987] CPU   13: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153988] CPU   14: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153989] CPU   15: hi:    0, btch:   1 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153990] Node 0 DMA32 per-cpu:
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153991] CPU    0: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153992] CPU    1: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153993] CPU    2: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153995] CPU    3: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153996] CPU    4: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153997] CPU    5: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153998] CPU    6: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.153999] CPU    7: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154001] CPU    8: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154002] CPU    9: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154003] CPU   10: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154004] CPU   11: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154005] CPU   12: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154006] CPU   13: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154007] CPU   14: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154008] CPU   15: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154009] Node 0 Normal per-cpu:
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154010] CPU    0: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154011] CPU    1: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154012] CPU    2: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154013] CPU    3: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154014] CPU    4: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154015] CPU    5: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154016] CPU    6: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154017] CPU    7: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154018] CPU    8: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154019] CPU    9: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154020] CPU   10: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154021] CPU   11: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154023] CPU   12: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154024] CPU   13: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154025] CPU   14: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154026] CPU   15: hi:  186, btch:  31 usd:   0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154028] active_anon:7546116 inactive_anon:3718 isolated_anon:0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154028]  active_file:405 inactive_file:19 isolated_file:0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154028]  unevictable:5 dirty:193 writeback:0 unstable:0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154028]  free:47219 slab_reclaimable:15089 slab_unreclaimable:22997
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154028]  mapped:6952 shmem:7403 pagetables:22940 bounce:0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154028]  free_cma:0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154031] Node 0 DMA free:15904kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154035] lowmem_reserve[]: 0 3744 30129 30129
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154038] Node 0 DMA32 free:113892kB min:8392kB low:10488kB high:12588kB active_anon:3689064kB inactive_anon:1468kB active_file:260kB inactive_file:20kB unevictable:4kB isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3836720kB mlocked:4kB dirty:104kB writeback:0kB mapped:4468kB shmem:4496kB slab_reclaimable:6640kB slab_unreclaimable:9296kB kernel_stack:3832kB pagetables:10816kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:451 all_unreclaimable? yes
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154041] lowmem_reserve[]: 0 0 26385 26385
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154044] Node 0 Normal free:59080kB min:59152kB low:73940kB high:88728kB active_anon:26495400kB inactive_anon:13404kB active_file:1360kB inactive_file:56kB unevictable:16kB isolated(anon):0kB isolated(file):0kB present:27525120kB managed:27019008kB mlocked:16kB dirty:668kB writeback:0kB mapped:23340kB shmem:25116kB slab_reclaimable:53716kB slab_unreclaimable:82692kB kernel_stack:33392kB pagetables:80944kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2483 all_unreclaimable? yes
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154047] lowmem_reserve[]: 0 0 0 0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154049] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15904kB
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154058] Node 0 DMA32: 167*4kB (UEM) 1991*8kB (UEM) 590*16kB (UEM) 275*32kB (UEM) 204*64kB (UEM) 139*128kB (UEM) 66*256kB (UEM) 32*512kB (EM) 11*1024kB (UEM) 2*2048kB (EM) 0*4096kB = 114324kB
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154068] Node 0 Normal: 15098*4kB (UEM) 36*8kB (EM) 3*16kB (EM) 1*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 60760kB
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154076] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154077] 7651 total pagecache pages
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154078] 0 pages in swap cache
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154079] Swap cache stats: add 0, delete 0, find 0/0
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154080] Free swap  = 0kB
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154081] Total swap = 0kB
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154082] 7864221 pages RAM
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154083] 0 pages HighMem/MovableOnly
      May  9 18:32:31 ip-172-31-7-2 kernel: [ 7498.154084] 126528 pages reserved
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344897] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344918] [  740]     0   740     4868       49      13        0             0 upstart-udev-br
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344920] [  747]     0   747    12521      234      27        0         -1000 systemd-udevd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344922] [  912]     0   912     3814       51      12        0             0 upstart-socket-
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344924] [  960]     0   960     2554      574       8        0             0 dhclient
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344926] [ 1256]     0  1256     3818       55      13        0             0 upstart-file-br
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344928] [ 1402]     0  1402     3633       41      12        0             0 getty
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344930] [ 1405]     0  1405     3633       40      12        0             0 getty
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344932] [ 1407]   101  1407    65017      688      29        0             0 rsyslogd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344934] [ 1410]     0  1410     3633       42      12        0             0 getty
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344935] [ 1411]     0  1411     3633       40      12        0             0 getty
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344937] [ 1413]     0  1413     3633       39      10        0             0 getty
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344939] [ 1437]     0  1437    15344      172      34        0         -1000 sshd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344941] [ 1465]     0  1465     4783       40      13        0             0 atd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344943] [ 1466]     0  1466     5912       53      17        0             0 cron
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344944] [ 1481]     0  1481     1091       36       7        0             0 acpid
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344946] [ 1490]   102  1490     9802      100      24        0             0 dbus-daemon
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344948] [ 1501]     0  1501    10861       89      25        0             0 systemd-logind
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344950] [ 1507]     0  1507     4863      112      14        0             0 irqbalance
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344952] [ 1608]     0  1608    26411      252      54        0             0 sshd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344954] [ 1729]  1000  1729    26999      847      56        0             0 sshd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344956] [ 1936]     0  1936     3633       39      12        0             0 getty
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344957] [ 1937]     0  1937     3195       38      12        0             0 getty
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344959] [ 2445]   106  2445     7863      151      19        0             0 ntpd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344961] [ 3564]   108  3564    33045     1466      55        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344962] [ 3566]   108  3566    33073     5407      65        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344964] [ 3567]   108  3567    33045      331      54        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344966] [ 3568]   108  3568    33045      528      52        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344968] [ 3569]   108  3569    33253      534      54        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344969] [ 3570]   108  3570    25222      376      49        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344971] [ 3797]  1000  3797  2656388    54163     197        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344973] [ 3840]  1000  3840     2826       97       9        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344975] [14391]     0 14391    26410      245      53        0             0 sshd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344977] [14454]  1000 14454    26410      252      51        0             0 sshd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344979] [14455]  1000 14455     5660      837      16        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344981] [18767]  1000 18767   421689    75153     280        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344982] [18818]  1000 18818   427646    72071     276        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344984] [18845]  1000 18845   422808    76361     287        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344986] [18986]  1000 18986   413940    85847     272        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344988] [19231]  1000 19231   423114    68900     237        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344989] [19255]  1000 19255   422304    98381     297        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344991] [19281]  1000 19281   453916    50170     285        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344993] [19307]  1000 19307   421858    73572     251        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344994] [20004]  1000 20004  2625480    62614     227        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344996] [20059]  1000 20059  2417786   642515    1907        0             0 kudu-tserver
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344998] [20075]  1000 20075   170158     4632     136        0             0 kudu-master
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.344999] [20091]  1000 20091  2534506   681406    2028        0             0 kudu-tserver
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345001] [20100]  1000 20100  2480736   678210    1996        0             0 kudu-tserver
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345003] [21180]  1000 21180     2812       85      10        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345004] [21194]  1000 21194  2131760    52636     186        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345006] [21277]  1000 21277     3354      114      11        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345008] [21291]  1000 21291  2176412    96677     367        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345010] [21441]  1000 21441     3354      114      12        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345011] [21455]  1000 21455  2171189    84863     323        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345013] [21619]  1000 21619     3354      115      11        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345015] [21633]  1000 21633  2174403   126755     412        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345016] [21773]  1000 21773     3354      115      10        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345018] [21787]  1000 21787  2165621   105043     368        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345019] [22327]  1000 22327   699864    64647     359        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345021] [22650]   108 22650    34060     1947      62        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345023] [22651]   108 22651    34067     1837      61        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345025] [22695]   108 22695    34564     5014      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345026] [22696]   108 22696    34668     5491      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345028] [22701]  1000 22701   499743   165103     749        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345030] [22966]  1000 22966   404221    82336     258        0             0 java
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345031] [49266]   108 49266    34579     5136      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345033] [49267]   108 49267    34487     5004      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345034] [49434]   108 49434    34298     1980      61        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345036] [49435]   108 49435    34140     1836      60        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345037] [49438]   108 49438    34575     5159      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345039] [49439]   108 49439    34505     5146      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345041] [49441]   108 49441    34590     5236      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345042] [49442]   108 49442    34553     5045      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345044] [49486]   108 49486    34507     5106      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345046] [49487]   108 49487    34573     5240      67        0             0 postgres
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345048] [12270]  1000 12270     3345      105      11        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345049] [12634]  1000 12634   106243     2424     105        0             0 statestored
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345051] [12642]  1000 12642  2177880    69774     301        0             0 catalogd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345052] [12708]  1000 12708  2599753    71304     505        0             0 impalad
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345054] [12775]  1000 12775  2599480    69011     503        0             0 impalad
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345055] [12844]  1000 12844  2599354    70697     503        0             0 impalad
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345057] [13564]  1000 13564     3411      159      12        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345059] [13565]  1000 13565     2623      114      11        0             0 make
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345060] [13568]  1000 13568     6091      149      16        0             0 ctest
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345062] [70617]     0 70617    26410      246      55        0             0 sshd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345063] [71103]  1000 71103    26444      247      53        0             0 sshd
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345065] [71113]  1000 71113     5628      806      16        0             0 bash
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345066] [ 2745]  1000  2745  2286602   253086     810        0             0 buffered-tuple-
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345068] [ 3922]  1000  3921  5818436  3452952    6822        0             0 free-pool-test
      May  9 18:35:52 ip-172-31-7-2 kernel: [ 7699.345070] Out of memory: Kill process 3922 (free-pool-test) score 448 or sacrifice child
      

      and shortly after, the output of meminfo:

      ubuntu@ip-172-31-7-2:~/Impala/logs/be_tests$ cat /proc/meminfo 
      MemTotal:       30871632 kB
      MemFree:         7011044 kB
      Buffers:           40700 kB
      Cached:          5438488 kB
      SwapCached:            0 kB
      Active:         21124408 kB
      Inactive:        2125936 kB
      Active(anon):   17793440 kB
      Inactive(anon):     7516 kB
      Active(file):    3330968 kB
      Inactive(file):  2118420 kB
      Unevictable:          20 kB
      Mlocked:              20 kB
      SwapTotal:             0 kB
      SwapFree:              0 kB
      Dirty:              1132 kB
      Writeback:             0 kB
      AnonPages:      17781632 kB
      Mapped:           138604 kB
      Shmem:             29780 kB
      Slab:             276868 kB
      SReclaimable:     182520 kB
      SUnreclaim:        94348 kB
      KernelStack:       39152 kB
      PageTables:        66140 kB
      NFS_Unstable:          0 kB
      Bounce:                0 kB
      WritebackTmp:          0 kB
      CommitLimit:    15435816 kB
      Committed_AS:   49437784 kB
      VmallocTotal:   34359738367 kB
      VmallocUsed:       72396 kB
      VmallocChunk:   34359655000 kB
      HardwareCorrupted:     0 kB
      AnonHugePages:  15386624 kB
      HugePages_Total:       0
      HugePages_Free:        0
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      Hugepagesize:       2048 kB
      DirectMap4k:       38912 kB
      DirectMap2M:     3237888 kB
      DirectMap1G:    28311552 kB
      

      We should probably have larger VMs for these jobs, but may also need to consider reducing the mem needed for BE tests.

        Activity

        Hide
        jbapple Jim Apple added a comment -

        The machines have 30GB of memory, since they are c4.4xlarge. I wonder why CommitLimit is so much lower?

        Show
        jbapple Jim Apple added a comment - The machines have 30GB of memory, since they are c4.4xlarge. I wonder why CommitLimit is so much lower?
        Hide
        mjacobs Matthew Jacobs added a comment -

        according to https://access.redhat.com/solutions/665023

        CommitLimit = ([total RAM pages] - [total huge TLB pages]) * overcommit_ratio / 100 + [total swap pages]

        ubuntu@ip-172-31-7-2:~$ cat /proc/sys/vm/overcommit_ratio
        50

        Perhaps we can set overcommit_ratio to 100?

        Show
        mjacobs Matthew Jacobs added a comment - according to https://access.redhat.com/solutions/665023 CommitLimit = ( [total RAM pages] - [total huge TLB pages] ) * overcommit_ratio / 100 + [total swap pages] ubuntu@ip-172-31-7-2:~$ cat /proc/sys/vm/overcommit_ratio 50 Perhaps we can set overcommit_ratio to 100?
        Hide
        mikesbrown Michael Brown added a comment -

        If I read this correctly,

        MemTotal:       30871632 kB
        MemFree:         7011044 kB
        CommitLimit:    15435816 kB
        Committed_AS:   49437784 kB
        

        and these references correctly

        https://linux.die.net/man/5/proc (read about Committed_AS, Committed, overcommit_ratio, and overcommit_memory).
        https://serverfault.com/questions/362589/effects-of-configuring-vm-overcommit-memory

        It seems we could try:

        echo 100 > /proc/sys/vm/overcommit_ratio
        echo 2 > /proc/sys/vm/overcommit_memory
        

        However, MemFree is looking a little low. It's possible this is only a stopgap.

        Do we need to consider more memory on hosts, or less memory needed for backend tests?

        Show
        mikesbrown Michael Brown added a comment - If I read this correctly, MemTotal: 30871632 kB MemFree: 7011044 kB CommitLimit: 15435816 kB Committed_AS: 49437784 kB and these references correctly https://linux.die.net/man/5/proc (read about Committed_AS, Committed, overcommit_ratio, and overcommit_memory). https://serverfault.com/questions/362589/effects-of-configuring-vm-overcommit-memory It seems we could try: echo 100 > /proc/sys/vm/overcommit_ratio echo 2 > /proc/sys/vm/overcommit_memory However, MemFree is looking a little low. It's possible this is only a stopgap. Do we need to consider more memory on hosts, or less memory needed for backend tests?
        Hide
        mjacobs Matthew Jacobs added a comment -

        I'm not sure if we need to make such large allocations in free-pool-test, so maybe we can get away with reducing it's max allocation size. That said, I also need to figure out if it's expected that Kudu is now using more memory than it was before, as it seems to be what caused us to start failing. I assume we were close to the limit before.

        Show
        mjacobs Matthew Jacobs added a comment - I'm not sure if we need to make such large allocations in free-pool-test, so maybe we can get away with reducing it's max allocation size. That said, I also need to figure out if it's expected that Kudu is now using more memory than it was before, as it seems to be what caused us to start failing. I assume we were close to the limit before.
        Hide
        tarmstrong Tim Armstrong added a comment -

        If Kudu's memory requirements increased this is going to have other impacts. start-impala-cluster.py assumes that 80% of system memory can be used for impalads.

        Show
        tarmstrong Tim Armstrong added a comment - If Kudu's memory requirements increased this is going to have other impacts. start-impala-cluster.py assumes that 80% of system memory can be used for impalads.
        Hide
        mjacobs Matthew Jacobs added a comment -

        I'll test a patch to add a Kudu mem_limit. We don't currently have one, so Kudu also assumes it can use 80% of the system memory

        Show
        mjacobs Matthew Jacobs added a comment - I'll test a patch to add a Kudu mem_limit. We don't currently have one, so Kudu also assumes it can use 80% of the system memory
        Hide
        mjacobs Matthew Jacobs added a comment -

        Will address the Kudu mem_limit separately. It might require a bit more thought because 1g (what I just tested as a potential tserver mem limit) seems to be too low for some tests.

        commit e7cb80b66f6ff653fa41713c090a9284c5bbc1f7
        Author: Matthew Jacobs <mj@cloudera.com>
        Date: Wed May 10 10:13:38 2017 -0700

        IMPALA-5297: Reduce free-pool-test mem requirement to avoid OOM

        On jenkins.impala.io gerrit-verify-dryrun jobs,
        free-pool-test started running out of memory after updating
        Kudu to a newer version which can have slightly higher
        memory requirements (Kudu will reject writes when mem is 80%
        full rather than 60% full). free-pool-test can allocate up
        to 12gb, and the VMs have a CommitLimit of only 16gb.

        While larger VMs could be used, or the minicluster could be
        tuned further, free-pool-test can also be modified to
        reduce the actual RSS usage. Instead of memsetting the
        entire allocation in the test, we only scribble the first
        byte. This reduces the max RSS from 14gb to 88mb in some
        local tests.

        Change-Id: I31f03e7a4d5d237a1183277c988f85a992396a43
        Reviewed-on: http://gerrit.cloudera.org:8080/6842
        Reviewed-by: Dan Hecht <dhecht@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        mjacobs Matthew Jacobs added a comment - Will address the Kudu mem_limit separately. It might require a bit more thought because 1g (what I just tested as a potential tserver mem limit) seems to be too low for some tests. commit e7cb80b66f6ff653fa41713c090a9284c5bbc1f7 Author: Matthew Jacobs <mj@cloudera.com> Date: Wed May 10 10:13:38 2017 -0700 IMPALA-5297 : Reduce free-pool-test mem requirement to avoid OOM On jenkins.impala.io gerrit-verify-dryrun jobs, free-pool-test started running out of memory after updating Kudu to a newer version which can have slightly higher memory requirements (Kudu will reject writes when mem is 80% full rather than 60% full). free-pool-test can allocate up to 12gb, and the VMs have a CommitLimit of only 16gb. While larger VMs could be used, or the minicluster could be tuned further, free-pool-test can also be modified to reduce the actual RSS usage. Instead of memsetting the entire allocation in the test, we only scribble the first byte. This reduces the max RSS from 14gb to 88mb in some local tests. Change-Id: I31f03e7a4d5d237a1183277c988f85a992396a43 Reviewed-on: http://gerrit.cloudera.org:8080/6842 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins
        Hide
        mjacobs Matthew Jacobs added a comment -
        Show
        mjacobs Matthew Jacobs added a comment - filed IMPALA-5301

          People

          • Assignee:
            mjacobs Matthew Jacobs
            Reporter:
            mjacobs Matthew Jacobs
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development