Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-2337

Predicate pushdown erroneously conservative with outer joins

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0
    • Query Processor
    • None
    • Reviewed

    Description

      The predicate pushdown filter is not applying left associativity of joins correctly in determining possible aliases for pushing predicates.

      In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for pushing aliases is specified as:

          /**
           * Figures out the aliases for whom it is safe to push predicates based on
           * ANSI SQL semantics For inner join, all predicates for all aliases can be
           * pushed For full outer join, none of the predicates can be pushed as that
           * would limit the number of rows for join For left outer join, all the
           * predicates on the left side aliases can be pushed up For right outer
           * join, all the predicates on the right side aliases can be pushed up Joins
           * chain containing both left and right outer joins are treated as full
           * outer join. [...]
           *
           * @param op
           *          Join Operator
           * @param rr
           *          Row resolver
           * @return set of qualified aliases
           */
      

      Since hive joins are left associative, something like "a RIGHT OUTER JOIN b LEFT OUTER JOIN cĀ INNER JOIN d" should be interpreted as "((a RIGHT OUTER JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins with both left and right outer joins can have aliases that can be pushed. Here, aliases b and d are eligible to be pushed up while the current criteria provide that none are eligible.

      Using:

      create table t1 (id int, key string, value string);
      create table t2 (id int, key string, value string);
      create table t3 (id int, key string, value string);
      create table t4 (id int, key string, value string);
      

      For example, the query

      explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on t2.id=t3.id where t3.id=20; 
      

      currently gives

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Alias -> Map Operator Tree:
              t1 
                TableScan
                  alias: t1
                  Reduce Output Operator
                    key expressions:
                          expr: id
                          type: int
                    sort order: +
                    Map-reduce partition columns:
                          expr: id
                          type: int
                    tag: 0
                    value expressions:
                          expr: id
                          type: int
                          expr: key
                          type: string
                          expr: value
                          type: string
              t2 
                TableScan
                  alias: t2
                  Reduce Output Operator
                    key expressions:
                          expr: id
                          type: int
                    sort order: +
                    Map-reduce partition columns:
                          expr: id
                          type: int
                    tag: 1
                    value expressions:
                          expr: id
                          type: int
                          expr: key
                          type: string
                          expr: value
                          type: string
              t3 
                TableScan
                  alias: t3
                  Reduce Output Operator
                    key expressions:
                          expr: id
                          type: int
                    sort order: +
                    Map-reduce partition columns:
                          expr: id
                          type: int
                    tag: 2
                    value expressions:
                          expr: id
                          type: int
                          expr: key
                          type: string
                          expr: value
                          type: string
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Outer Join 0 to 1
                     Inner Join 1 to 2
                condition expressions:
                  0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                  2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                handleSkewJoin: false
                outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10, _col11, _col12
                Filter Operator
                  predicate:
                      expr: (_col10 = 20)
                      type: boolean
                  Select Operator
                    expressions:
                          expr: _col0
                          type: int
                          expr: _col1
                          type: string
                          expr: _col2
                          type: string
                          expr: _col5
                          type: int
                          expr: _col6
                          type: string
                          expr: _col7
                          type: string
                          expr: _col10
                          type: int
                          expr: _col11
                          type: string
                          expr: _col12
                          type: string
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      

      while the correct behavior is to push the filter "t3.id=20" down:

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Alias -> Map Operator Tree:
              t1 
                TableScan
                  alias: t1
                  Reduce Output Operator
                    key expressions:
                          expr: id
                          type: int
                    sort order: +
                    Map-reduce partition columns:
                          expr: id
                          type: int
                    tag: 0
                    value expressions:
                          expr: id
                          type: int
                          expr: key
                          type: string
                          expr: value
                          type: string
              t2 
                TableScan
                  alias: t2
                  Reduce Output Operator
                    key expressions:
                          expr: id
                          type: int
                    sort order: +
                    Map-reduce partition columns:
                          expr: id
                          type: int
                    tag: 1
                    value expressions:
                          expr: id
                          type: int
                          expr: key
                          type: string
                          expr: value
                          type: string
              t3 
                TableScan
                  alias: t3
                  Filter Operator
                    predicate:
                        expr: (id = 20)
                        type: boolean
                    Reduce Output Operator
                      key expressions:
                            expr: id
                            type: int
                      sort order: +
                      Map-reduce partition columns:
                            expr: id
                            type: int
                      tag: 2
                      value expressions:
                            expr: id
                            type: int
                            expr: key
                            type: string
                            expr: value
                            type: string
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Outer Join 0 to 1
                     Inner Join 1 to 2
                condition expressions:
                  0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                  2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                handleSkewJoin: false
                outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10, _col11, _col12
                Select Operator
                  expressions:
                        expr: _col0
                        type: int
                        expr: _col1
                        type: string
                        expr: _col2
                        type: string
                        expr: _col5
                        type: int
                        expr: _col6
                        type: string
                        expr: _col7
                        type: string
                        expr: _col10
                        type: int
                        expr: _col11
                        type: string
                        expr: _col12
                        type: string
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8
                  File Output Operator
                    compressed: false
                    GlobalTableId: 0
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      

      The current behavior is actually stranger than this: for a left outer join (similarly for a right outer join), hive finds the leftmost alias referred to in the predicates of left outer joins and rejects any alias to the right of it for pushdown. So in this query the filter "t2.id=20" pushed down:

      explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on (t2.id=t3.id) where t2.id=20;
      

      while it isn't here:

      explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on (t1.id=t3.id) where t2.id=20;
      

      Attachments

        1. HIVE-2337v7.patch
          54 kB
          Charles Chen
        2. HIVE-2337v6.patch
          25 kB
          Charles Chen
        3. HIVE-2337v5.patch
          9 kB
          Charles Chen
        4. HIVE-2337v4.patch
          25 kB
          Charles Chen
        5. HIVE-2337v3.patch
          21 kB
          Charles Chen
        6. HIVE-2337v2.patch
          20 kB
          Charles Chen
        7. HIVE-2337v1.patch
          4 kB
          Charles Chen

        Issue Links

          Activity

            People

              ccy Charles Chen
              ccy Charles Chen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: