Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 1.3.1, Impala 2.0, Impala 2.3.0
Description
I think we assume n:1 joins so the cardinality is not reduced after the join operator
Here's an example from tpcds q19 but it applies to almost all the tpcds queries.
Operator Detail #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem ------------------------------------------------------------------------------------------------------------------------------ 18:TOP-N 1 3.708ms 3.708ms 100 100 117.44 KB -1.00 B 17:EXCHANGE 1 236.149us 236.149us 1000 100 9.77 KB -1.00 B 10:TOP-N 10 555.602us 591.50us 1000 100 36.00 KB 4.69 KB 16:AGGREGATE 10 4.950ms 5.85ms 3227 264233526 184.84 KB 6.75 GB 15:EXCHANGE 10 14s941ms 14s952ms 32270 264233526 30.09 KB 0 09:AGGREGATE 10 201.615ms 280.335ms 32270 264233526 13.68 MB 12.99 GB 08:HASH JOIN BROADCAST 10 112.415ms 168.229ms 4646655 264233526 12.32 MB 42.00 KB |--14:EXCHANGE 10 286.415ms 296.645ms 1350 1350 44.96 KB 0 | 04:SCAN HDFS store 1 39.760ms 39.760ms 1350 1350 243.32 KB 32.00 MB 07:HASH JOIN BROADCAST 10 3s519ms 3s900ms 4708900 264233526 993.89 MB 453.97 MB |--13:EXCHANGE 10 2s731ms 2s912ms 15000000 15000000 21.86 MB 0 | 03:SCAN HDFS customer_address 8 328.503ms 862.653ms 15000000 15000000 26.30 MB 80.00 MB 06:HASH JOIN BROADCAST 10 16s891ms 17s963ms 4708900 264233526 1.44 GB 377.66 MB |--12:EXCHANGE 10 3s785ms 4s422ms 30000000 30000000 17.84 MB 0 | 02:SCAN HDFS customer 9 292.638ms 1s030ms 30000000 30000000 40.81 MB 176.00 MB 05:HASH JOIN BROADCAST 10 2s903ms 3s397ms 4823041 264233526 1.10 MB 292.41 KB <-- We assumed all probe rows are returned and now the estimate is off by a factor of 54. |--11:EXCHANGE 10 1s629ms 1s642ms 6478 3429 99.51 KB 0 | 01:SCAN HDFS item 1 284.503ms 284.503ms 6478 3429 7.76 MB 288.00 MB <-- The estimate of the first dim table is close 00:SCAN HDFS store_sales 10 134.410ms 227.81ms 264233526 264233526 446.98 MB 352.00 MB <-- We have perfect stats on the left most table.