Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.3.0
-
None
Description
In some situations the Hdfs scan nodes, e.g. the parquet scanner, use excessive amounts of memory. The below query consumes over 2GB of memory in the scanner. The problem appears to be that very large batches (8MB) are being produced by the scanner and the current methods for limiting memory usage (queue length I believe) do not work well with large batches.
[tarmstrong-box.ca.cloudera.com:21000] > use tpch25_nested_parquet; Query: use tpch25_nested_parquet [tarmstrong-box.ca.cloudera.com:21000] > select t10.c_custkey AS int_col, COALESCE(t10.c_custkey, t8.int_col_t5, t9.s_suppkey) AS int_col_t11 FROM customer t1 INNER JOIN ( SELECT (COALESCE(t3.item.l_partkey, t3.item.l_partkey, t3.item.l_partkey)) - (MIN(t3.item.l_partkey)) AS int_col, COALESCE(SUM(t3.item.l_partkey), t3.item.l_partkey, t3.item.l_partkey) AS int_col_t5, t3.item.l_partkey AS int_col_t6 FROM t1.c_orders.item.o_lineitems t2 LEFT JOIN t1.c_orders.item.o_lineitems t3 ON (t3.item.l_shipinstruct) = (t2.item.l_comment) LEFT JOIN t1.c_orders.item.o_lineitems t4 ON (((t4.item.l_suppkey) = (t2.pos)) AND ((t4.item.l_shipdate) = (t2.item.l_comment))) AND ((t4.item.l_comment) = (t2.item.l_comment)) WHERE (t3.pos) NOT IN (t3.item.l_partkey, 184.99) GROUP BY t3.item.l_partkey HAVING (MAX(t2.pos)) IN ((t3.item.l_partkey) - (t3.item.l_partkey), t3.item.l_partkey) ) t8 INNER JOIN supplier t9 ON (t9.s_nationkey) = (t8.int_col) LEFT JOIN customer t10 ON (t10.c_custkey) = (t9.s_nationkey); Query: select t10.c_custkey AS int_col, COALESCE(t10.c_custkey, t8.int_col_t5, t9.s_suppkey) AS int_col_t11 FROM customer t1 INNER JOIN ( SELECT (COALESCE(t3.item.l_partkey, t3.item.l_partkey, t3.item.l_partkey)) - (MIN(t3.item.l_partkey)) AS int_col, COALESCE(SUM(t3.item.l_partkey), t3.item.l_partkey, t3.item.l_partkey) AS int_col_t5, t3.item.l_partkey AS int_col_t6 FROM t1.c_orders.item.o_lineitems t2 LEFT JOIN t1.c_orders.item.o_lineitems t3 ON (t3.item.l_shipinstruct) = (t2.item.l_comment) LEFT JOIN t1.c_orders.item.o_lineitems t4 ON (((t4.item.l_suppkey) = (t2.pos)) AND ((t4.item.l_shipdate) = (t2.item.l_comment))) AND ((t4.item.l_comment) = (t2.item.l_comment)) WHERE (t3.pos) NOT IN (t3.item.l_partkey, 184.99) GROUP BY t3.item.l_partkey HAVING (MAX(t2.pos)) IN ((t3.item.l_partkey) - (t3.item.l_partkey), t3.item.l_partkey) ) t8 INNER JOIN supplier t9 ON (t9.s_nationkey) = (t8.int_col) LEFT JOIN customer t10 ON (t10.c_custkey) = (t9.s_nationkey)
Workaround
Setting num_scanner_threads to a lower number will reduce memory usage at some cost to scan performance. Setting batch_size to a lower value will also reduce memory usage at a cost to query performance.