Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.3.0
Description
For queries that reference nested collections, we assign predicates that reference fields of the nested collection directly in the corresponding parent scan. Those predicates affect how many items are materialized inside a collection-typed slot. There's a good chance that such predicates result in empty collections, but we currently do not filter out the containing tuple if there are such empty collections. Consider this query and its plan:
Query:
select c_custkey, o_orderkey
from tpch_nested_parquet.customer c, c.c_orders
where o_orderkey = 1884930
Plan:
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=176.00MB VCores=1 |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| tpch_nested_parquet.customer |
| |
| 05:EXCHANGE [UNPARTITIONED] |
| | |
| 01:SUBPLAN |
| | |
| |--04:NESTED LOOP JOIN [CROSS JOIN] |
| | | |
| | |--02:SINGULAR ROW SRC |
| | | |
| | 03:UNNEST [c.c_orders] |
| | |
| 00:SCAN HDFS [tpch_nested_parquet.customer c] |
| partitions=1/1 files=4 size=554.13MB |
| predicates on c_orders: o_orderkey = 1884930 |
+------------------------------------------------------------------------------------+
In the query above, the scan returns one row for every customer. It is clear though that only few customer rows will result in any output from the subplan. Avoiding subplan iterations is important because they are expensive even when unnesting empty collections (due to Reset() of the plan tree).
For nested TPCH Q12 on my dev setup this improvement reduced the number of rows returned from the scan by >100x, from 1.5M to 13K.
It seems that the following nested TPCH queries could also benefit from this improvement:
Q3, Q4, Q5, Q7, Q8, Q10, Q12, Q13, Q21
When considering this optimization, special consideration must be given to outer or semi joined nested collections because then applying this optimization may not be correct.