Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-26043

Use constraint info when creating RexNodes

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • CBO

    Description

      Prior HIVE-23100 Not null constraints affected newly created RexNode type nullability.
      Nullability enables the subquery rewrite algorithm to generate more optimal plan.
      https://github.com/apache/hive/blob/1213ad3f0ae0e21e7519dc28b8b6d1401cdd1441/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSubQueryRemoveRule.java#L324

      Example:

      explain cbo
      select ws_sales_price
       from web_sales, customer, item
       where ws_bill_customer_sk = c_customer_sk
       	and ws_item_sk = i_item_sk
       	and ( c_customer_sk = 1
       	      or
       	      i_item_id in (select i_item_id
                                   from item
                                   where i_item_sk in (2, 3)
                                   )
       	    );
      

      Without not null constraints

      HiveProject(ws_sales_price=[$2])
        HiveFilter(condition=[OR(AND(<>($6, 0), IS NOT NULL($8)), =($3, 1))])
          HiveProject(ws_item_sk=[$0], ws_bill_customer_sk=[$1], ws_sales_price=[$2], c_customer_sk=[$8], i_item_sk=[$3], i_item_id=[$4], c=[$5], i_item_id0=[$6], literalTrue=[$7])
            HiveJoin(condition=[=($1, $8)], joinType=[inner], algorithm=[none], cost=[not available])
              HiveJoin(condition=[=($0, $3)], joinType=[inner], algorithm=[none], cost=[not available])
                HiveProject(ws_item_sk=[$2], ws_bill_customer_sk=[$3], ws_sales_price=[$20])
                  HiveFilter(condition=[IS NOT NULL($3)])
                    HiveTableScan(table=[[default, web_sales]], table:alias=[web_sales])
                HiveJoin(condition=[=($1, $3)], joinType=[left], algorithm=[none], cost=[not available])
                  HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not available])
                    HiveProject(i_item_sk=[$0], i_item_id=[$1])
                      HiveTableScan(table=[[default, item]], table:alias=[item])
                    HiveProject(c=[$0])
                      HiveAggregate(group=[{}], c=[COUNT()])
                        HiveFilter(condition=[IN($0, 2:BIGINT, 3:BIGINT)])
                          HiveTableScan(table=[[default, item]], table:alias=[item])
                  HiveProject(i_item_id=[$0], literalTrue=[true])
                    HiveAggregate(group=[{1}])
                      HiveFilter(condition=[IN($0, 2:BIGINT, 3:BIGINT)])
                        HiveTableScan(table=[[default, item]], table:alias=[item])
              HiveProject(c_customer_sk=[$0])
                HiveTableScan(table=[[default, customer]], table:alias=[customer])
      

      With not null constraints

      HiveProject(ws_sales_price=[$2])
        HiveFilter(condition=[OR(IS NOT NULL($7), =($3, 1))])
          HiveProject(ws_item_sk=[$0], ws_bill_customer_sk=[$1], ws_sales_price=[$2], c_customer_sk=[$7], i_item_sk=[$3], i_item_id=[$4], i_item_id0=[$5], literalTrue=[$6])
            HiveJoin(condition=[=($1, $7)], joinType=[inner], algorithm=[none], cost=[not available])
              HiveJoin(condition=[=($0, $3)], joinType=[inner], algorithm=[none], cost=[not available])
                HiveProject(ws_item_sk=[$2], ws_bill_customer_sk=[$3], ws_sales_price=[$20])
                  HiveFilter(condition=[IS NOT NULL($3)])
                    HiveTableScan(table=[[default, web_sales]], table:alias=[web_sales])
                HiveJoin(condition=[=($1, $2)], joinType=[left], algorithm=[none], cost=[not available])
                  HiveProject(i_item_sk=[$0], i_item_id=[$1])
                    HiveTableScan(table=[[default, item]], table:alias=[item])
                  HiveProject(i_item_id=[$0], literalTrue=[true])
                    HiveAggregate(group=[{1}])
                      HiveFilter(condition=[IN($0, 2:BIGINT, 3:BIGINT)])
                        HiveTableScan(table=[[default, item]], table:alias=[item])
              HiveProject(c_customer_sk=[$0])
                HiveTableScan(table=[[default, customer]], table:alias=[customer])
      

      In the first plan when not null constraints was ignored there is an extra item table join without join condition:

                  HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[not available])
                    HiveProject(i_item_sk=[$0], i_item_id=[$1])
                      HiveTableScan(table=[[default, item]], table:alias=[item])
                    HiveProject(c=[$0])
                      HiveAggregate(group=[{}], c=[COUNT()])
                        HiveFilter(condition=[IN($0, 2:BIGINT, 3:BIGINT)])
                          HiveTableScan(table=[[default, item]], table:alias=[item])
      
      

      The planner is not aware that the i_item_id column has not null defined and it expects null values which needs the extra join.

      Attachments

        Issue Links

          Activity

            People

              kkasa Krisztian Kasa
              kkasa Krisztian Kasa
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m