[HIVE-14708] Optimizer: NOT IN query scans one input two times - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: Logical Optimizer
Labels:
None

Description

hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from store_sales where ss_sold_date_sk NOT in (select d_date_sk from date_dim);

    Stage-1
      Reducer 2 vectorized, llap
      File Output Operator [FS_52]
        Group By Operator [GBY_51] (rows=1 width=8)
          Output:["_col0"],aggregations:["count(VALUE._col0)"]
        <-Map 1 [SIMPLE_EDGE] vectorized, llap
          SHUFFLE [RS_50]
            Group By Operator [GBY_49] (rows=1 width=8)
              Output:["_col0"],aggregations:["count(1)"]
              Select Operator [SEL_48] (rows=1 width=4)
                Filter Operator [FIL_47] (rows=1 width=4)
                  predicate:_col2 is null
                  Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4)
                    Conds:MAPJOIN_45._col0=RS_43._col0(Left Outer),Output:["_col2"]
                  <-Map 5 [BROADCAST_EDGE] vectorized, llap
                    BROADCAST [RS_43]
                      PartitionCols:_col0
                      Select Operator [SEL_42] (rows=73049 width=4)
                        Output:["_col0"]
                        TableScan [TS_11] (rows=73049 width=4)
                          tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
                  <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4)
                      Conds:(Inner),Output:["_col0"]
                    <-Reducer 4 [BROADCAST_EDGE] vectorized, llap
                      BROADCAST [RS_41]
                        Select Operator [SEL_40] (rows=1 width=8)
                          Filter Operator [FIL_39] (rows=1 width=8)
                            predicate:(_col0 = 0)
                            Group By Operator [GBY_38] (rows=1 width=8)
                              Output:["_col0"],aggregations:["count(VALUE._col0)"]
                            <-Map 3 [SIMPLE_EDGE] vectorized, llap
                              SHUFFLE [RS_37]
                                Group By Operator [GBY_36] (rows=1 width=8)
                                  Output:["_col0"],aggregations:["count()"]
                                  Select Operator [SEL_35] (rows=1 width=4)
                                    Filter Operator [FIL_34] (rows=1 width=4)
                                      predicate:d_date_sk is null
                                      TableScan [TS_2] (rows=73049 width=4)
                                        tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
                    <-Select Operator [SEL_44] (rows=2879987999 width=4)
                        Output:["_col0"]
                        TableScan [TS_0] (rows=2879987999 width=92)
                          tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE

The 2nd scan is merely to count the number of NULLs and has predicate:d_date_sk is null in the operator.

The NULL checks can be done inline with the NOT-NULL codepath instead of producing 2 independent full-scans of the date_dim table.

This is not significant in a scenario like the above where the small table side is an actual HDFS table, but entirely throttles performance when the small side is actually an expensive aggregate.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gopal Vijayaraghavan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Sep/16 17:36

Updated:: 07/Sep/16 00:24