[IMPALA-2473] Excessive memory usage by scan nodes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.3.0
Fix Version/s: Impala 2.5.0, Impala 2.3.2, Impala 2.2.10
Component/s: None
Labels:
- performance
- resource-management

Target Version:

Impala 2.5.0, Impala 2.3.2

Description

In some situations the Hdfs scan nodes, e.g. the parquet scanner, use excessive amounts of memory. The below query consumes over 2GB of memory in the scanner. The problem appears to be that very large batches (8MB) are being produced by the scanner and the current methods for limiting memory usage (queue length I believe) do not work well with large batches.

[tarmstrong-box.ca.cloudera.com:21000] > use tpch25_nested_parquet;
Query: use tpch25_nested_parquet
[tarmstrong-box.ca.cloudera.com:21000] > select
t10.c_custkey AS int_col,
COALESCE(t10.c_custkey, t8.int_col_t5, t9.s_suppkey) AS int_col_t11
FROM customer t1
INNER JOIN (
SELECT
(COALESCE(t3.item.l_partkey, t3.item.l_partkey, t3.item.l_partkey)) - (MIN(t3.item.l_partkey)) AS int_col,
COALESCE(SUM(t3.item.l_partkey), t3.item.l_partkey, t3.item.l_partkey) AS int_col_t5,
t3.item.l_partkey AS int_col_t6
FROM t1.c_orders.item.o_lineitems t2                                                                                                     LEFT JOIN t1.c_orders.item.o_lineitems t3 ON (t3.item.l_shipinstruct) = (t2.item.l_comment)
LEFT JOIN t1.c_orders.item.o_lineitems t4 ON (((t4.item.l_suppkey) = (t2.pos)) AND ((t4.item.l_shipdate) = (t2.item.l_comment))) AND ((t4.item.l_comment) = (t2.item.l_comment))
WHERE
(t3.pos) NOT IN (t3.item.l_partkey, 184.99)
GROUP BY
t3.item.l_partkey
HAVING
(MAX(t2.pos)) IN ((t3.item.l_partkey) - (t3.item.l_partkey), t3.item.l_partkey)
) t8
INNER JOIN supplier t9 ON (t9.s_nationkey) = (t8.int_col)
LEFT JOIN customer t10 ON (t10.c_custkey) = (t9.s_nationkey);
Query: select t10.c_custkey AS int_col,
COALESCE(t10.c_custkey, t8.int_col_t5, t9.s_suppkey) AS int_col_t11
FROM customer t1
INNER JOIN (
SELECT
(COALESCE(t3.item.l_partkey, t3.item.l_partkey, t3.item.l_partkey)) - (MIN(t3.item.l_partkey)) AS int_col,
COALESCE(SUM(t3.item.l_partkey), t3.item.l_partkey, t3.item.l_partkey) AS int_col_t5,
t3.item.l_partkey AS int_col_t6
FROM t1.c_orders.item.o_lineitems t2
LEFT JOIN t1.c_orders.item.o_lineitems t3 ON (t3.item.l_shipinstruct) = (t2.item.l_comment)
LEFT JOIN t1.c_orders.item.o_lineitems t4 ON (((t4.item.l_suppkey) = (t2.pos)) AND ((t4.item.l_shipdate) = (t2.item.l_comment))) AND ((t4.item.l_comment) = (t2.item.l_comment))
WHERE
(t3.pos) NOT IN (t3.item.l_partkey, 184.99)
GROUP BY
t3.item.l_partkey
HAVING
(MAX(t2.pos)) IN ((t3.item.l_partkey) - (t3.item.l_partkey), t3.item.l_partkey)
) t8
INNER JOIN supplier t9 ON (t9.s_nationkey) = (t8.int_col)
LEFT JOIN customer t10 ON (t10.c_custkey) = (t9.s_nationkey)

Workaround
Setting num_scanner_threads to a lower number will reduce memory usage at some cost to scan performance. Setting batch_size to a lower value will also reduce memory usage at a cost to query performance.

Attachments

Activity

People

Assignee:: Tim Armstrong

Reporter:: Tim Armstrong

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Oct/15 23:03

Updated:: 01/May/18 21:20

Resolved:: 19/Nov/15 22:59