Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.5.0
-
None
Description
set NUM_NODES=0; select * from tpch.lineitem UNION ALL (select * from tpch.lineitem) LIMIT 1
Summary
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------- 00:UNION 1 85.646us 85.646us 1 12.00M 200.00 KB -1.00 B |--02:SCAN HDFS 1 177.568us 177.568us 0 6.00M 0 -1.00 B tpch.lineitem 01:SCAN HDFS 1 141.077ms 141.077ms 1.02K 6.00M 202.96 MB -1.00 B tpch.lineitem
Note 00:UNION estimates 12M rows of output even though the limit is pushed to its children.
Interestingly, if you look at the distributed plan:
Distributed Plan
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------- 03:EXCHANGE 1 14.515us 14.515us 1 1 0 -1.00 B UNPARTITIONED 00:UNION 3 92.674us 144.611us 3 12.00M 200.00 KB 0 |--02:SCAN HDFS 3 180.513us 221.991us 0 6.00M 0 264.00 MB tpch.lineitem 01:SCAN HDFS 3 155.859ms 163.053ms 3.07K 6.00M 65.18 MB 264.00 MB tpch.lineitem
The xchg node correctly reflects the limit, but the UNION still doesn't have the right estimate.
This affects runtime filters which use the single-node cardinalities to estimate the NDV for a particular filter.