Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14708

Optimizer: NOT IN query scans one input two times

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.2.0
    • None
    • Logical Optimizer
    • None

    Description

      hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from store_sales where ss_sold_date_sk NOT in (select d_date_sk from date_dim);
      
          Stage-1
            Reducer 2 vectorized, llap
            File Output Operator [FS_52]
              Group By Operator [GBY_51] (rows=1 width=8)
                Output:["_col0"],aggregations:["count(VALUE._col0)"]
              <-Map 1 [SIMPLE_EDGE] vectorized, llap
                SHUFFLE [RS_50]
                  Group By Operator [GBY_49] (rows=1 width=8)
                    Output:["_col0"],aggregations:["count(1)"]
                    Select Operator [SEL_48] (rows=1 width=4)
                      Filter Operator [FIL_47] (rows=1 width=4)
                        predicate:_col2 is null
                        Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4)
                          Conds:MAPJOIN_45._col0=RS_43._col0(Left Outer),Output:["_col2"]
                        <-Map 5 [BROADCAST_EDGE] vectorized, llap
                          BROADCAST [RS_43]
                            PartitionCols:_col0
                            Select Operator [SEL_42] (rows=73049 width=4)
                              Output:["_col0"]
                              TableScan [TS_11] (rows=73049 width=4)
                                tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
                        <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4)
                            Conds:(Inner),Output:["_col0"]
                          <-Reducer 4 [BROADCAST_EDGE] vectorized, llap
                            BROADCAST [RS_41]
                              Select Operator [SEL_40] (rows=1 width=8)
                                Filter Operator [FIL_39] (rows=1 width=8)
                                  predicate:(_col0 = 0)
                                  Group By Operator [GBY_38] (rows=1 width=8)
                                    Output:["_col0"],aggregations:["count(VALUE._col0)"]
                                  <-Map 3 [SIMPLE_EDGE] vectorized, llap
                                    SHUFFLE [RS_37]
                                      Group By Operator [GBY_36] (rows=1 width=8)
                                        Output:["_col0"],aggregations:["count()"]
                                        Select Operator [SEL_35] (rows=1 width=4)
                                          Filter Operator [FIL_34] (rows=1 width=4)
                                            predicate:d_date_sk is null
                                            TableScan [TS_2] (rows=73049 width=4)
                                              tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
                          <-Select Operator [SEL_44] (rows=2879987999 width=4)
                              Output:["_col0"]
                              TableScan [TS_0] (rows=2879987999 width=92)
                                tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE
      
      

      The 2nd scan is merely to count the number of NULLs and has predicate:d_date_sk is null in the operator.

      The NULL checks can be done inline with the NOT-NULL codepath instead of producing 2 independent full-scans of the date_dim table.

      This is not significant in a scenario like the above where the small table side is an actual HDFS table, but entirely throttles performance when the small side is actually an expensive aggregate.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gopalv Gopal Vijayaraghavan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: