Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
2.2.0
-
None
-
None
Description
hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from store_sales where ss_sold_date_sk NOT in (select d_date_sk from date_dim); Stage-1 Reducer 2 vectorized, llap File Output Operator [FS_52] Group By Operator [GBY_51] (rows=1 width=8) Output:["_col0"],aggregations:["count(VALUE._col0)"] <-Map 1 [SIMPLE_EDGE] vectorized, llap SHUFFLE [RS_50] Group By Operator [GBY_49] (rows=1 width=8) Output:["_col0"],aggregations:["count(1)"] Select Operator [SEL_48] (rows=1 width=4) Filter Operator [FIL_47] (rows=1 width=4) predicate:_col2 is null Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4) Conds:MAPJOIN_45._col0=RS_43._col0(Left Outer),Output:["_col2"] <-Map 5 [BROADCAST_EDGE] vectorized, llap BROADCAST [RS_43] PartitionCols:_col0 Select Operator [SEL_42] (rows=73049 width=4) Output:["_col0"] TableScan [TS_11] (rows=73049 width=4) tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"] <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4) Conds:(Inner),Output:["_col0"] <-Reducer 4 [BROADCAST_EDGE] vectorized, llap BROADCAST [RS_41] Select Operator [SEL_40] (rows=1 width=8) Filter Operator [FIL_39] (rows=1 width=8) predicate:(_col0 = 0) Group By Operator [GBY_38] (rows=1 width=8) Output:["_col0"],aggregations:["count(VALUE._col0)"] <-Map 3 [SIMPLE_EDGE] vectorized, llap SHUFFLE [RS_37] Group By Operator [GBY_36] (rows=1 width=8) Output:["_col0"],aggregations:["count()"] Select Operator [SEL_35] (rows=1 width=4) Filter Operator [FIL_34] (rows=1 width=4) predicate:d_date_sk is null TableScan [TS_2] (rows=73049 width=4) tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"] <-Select Operator [SEL_44] (rows=2879987999 width=4) Output:["_col0"] TableScan [TS_0] (rows=2879987999 width=92) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE
The 2nd scan is merely to count the number of NULLs and has predicate:d_date_sk is null in the operator.
The NULL checks can be done inline with the NOT-NULL codepath instead of producing 2 independent full-scans of the date_dim table.
This is not significant in a scenario like the above where the small table side is an actual HDFS table, but entirely throttles performance when the small side is actually an expensive aggregate.