Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5369 Annotate hive operator tree with statistics from metastore
  3. HIVE-7991

Incorrect calculation of number of rows in JoinStatsRule.process results in overflow

Log workAgile BoardRank to TopRank to BottomVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.13.1
    • 0.14.0
    • Statistics
    • None

    Description

      This loop results in adding the parent twice incase of a 3 way join of store_sales x date_dim x store

               for (int pos = 0; pos < parents.size(); pos++) {
                  ReduceSinkOperator parent = (ReduceSinkOperator) jop.getParentOperators().get(pos);
      
                  Statistics parentStats = parent.getStatistics();
                  List<ExprNodeDesc> keyExprs = parent.getConf().getKeyCols();
      
                  // Parent RS may have column statistics from multiple parents.
                  // Populate table alias to row count map, this will be used later to
                  // scale down/up column statistics based on new row count
                  // NOTE: JOIN with UNION as parent of RS will not have table alias
                  // propagated properly. UNION operator does not propagate the table
                  // alias of subqueries properly to expression nodes. Hence union20.q
                  // will have wrong number of rows.
                  Set<String> tableAliases = StatsUtils.getAllTableAlias(parent.getColumnExprMap());
                  for (String tabAlias : tableAliases) {
                    rowCountParents.put(tabAlias, parentStats.getNumRows());
                  }
      

      In the first join we have rowCountParents with

      {store_sales=120464862, date_dim=36524}

      which is correct.
      For the second join result rowCountParents ends up with

      {store=212, store_sales=120464862, date_dim=120464862}

      where it should be

      {store=212, store_sales=120464862, date_dim=36524}

      .
      The result of this is that computeNewRowCount ends up multiplying row count of store_sales x store_sales which makes the number of rows really high and eventually over flow.

      Plan snippet :

         Map 1
                  Map Operator Tree:
                      TableScan
                        alias: store_sales
                        filterExpr: (((ss_sold_date_sk is not null and ss_store_sk is not null) and ss_item_sk is not null) and ss_sold_date BETWEEN '1999-06-01' AND '2000-05-31') (type: boolean)
                        Statistics: Num rows: 110339135 Data size: 4817453454 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: ((ss_sold_date_sk is not null and ss_store_sk is not null) and ss_item_sk is not null) (type: boolean)
                          Statistics: Num rows: 107740258 Data size: 2124353556 Basic stats: COMPLETE Column stats: COMPLETE
                          Map Join Operator
                            condition map:
                                 Inner Join 0 to 1
                            condition expressions:
                              0 {ss_sold_date_sk} {ss_item_sk} {ss_store_sk} {ss_quantity} {ss_sales_price} {ss_sold_date}
                              1 {d_date_sk} {d_month_seq} {d_year} {d_moy} {d_qoy}
                            keys:
                              0 ss_sold_date_sk (type: int)
                              1 d_date_sk (type: int)
                            outputColumnNames: _col0, _col2, _col7, _col10, _col13, _col23, _col27, _col30, _col33, _col35, _col37
                            input vertices:
                              1 Map 6
                            Statistics: Num rows: 120464862 Data size: 26984129088 Basic stats: COMPLETE Column stats: COMPLETE
                            Map Join Operator
                              condition map:
                                   Inner Join 0 to 1
                              condition expressions:
                                0 {_col0} {_col2} {_col7} {_col10} {_col13} {_col23} {_col27} {_col30} {_col33} {_col35} {_col37}
                                1 {s_store_sk} {s_store_id}
                              keys:
                                0 _col7 (type: int)
                                1 s_store_sk (type: int)
                              outputColumnNames: _col0, _col2, _col7, _col10, _col13, _col23, _col27, _col30, _col33, _col35, _col37, _col58, _col59
                              input vertices:
                                1 Map 5
                              Statistics: Num rows: 17886616227069518 Data size: 5866810122478801920 Basic stats: COMPLETE Column stats: COMPLETE
                              Map Join Operator
                                condition map:
                                     Inner Join 0 to 1
                                condition expressions:
                                  0 {_col0} {_col2} {_col7} {_col10} {_col13} {_col23} {_col27} {_col30} {_col33} {_col35} {_col37} {_col58} {_col59}
                                  1 {i_item_sk} {i_brand} {i_class} {i_category} {i_product_name}
                                keys:
                                  0 _col2 (type: int)
                                  1 i_item_sk (type: int)
                                outputColumnNames: _col0, _col2, _col7, _col10, _col13, _col23, _col27, _col30, _col33, _col35, _col37, _col58, _col59, _col90, _col98, _col100, _col102, _col111
                                input vertices:
                                  1 Map 7
                                Statistics: Num rows: -9223372036854775808 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                                Filter Operator
                                  predicate: (((((_col0 = _col27) and (_col2 = _col90)) and (_col7 = _col58)) and _col30 BETWEEN 1193 AND 1204) and _col23 BETWEEN '1999-06-01' AND '2000-05-31') (type: boolean)
                                  Statistics: Num rows: -9223372036854775808 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                                  Select Operator
                                    expressions: _col102 (type: string), _col100 (type: string), _col98 (type: string), _col111 (type: string), _col33 (type: int), _col37 (type: int), _col35 (type: int), _col59 (type: string), _col13 (type: float), _col10 (type: int)
                                    outputColumnNames: _col102, _col100, _col98, _col111, _col33, _col37, _col35, _col59, _col13, _col10
                                    Statistics: Num rows: -9223372036854775808 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                                    Group By Operator
                                      aggregations: sum(COALESCE((_col13 * _col10),0))
                                      keys: _col102 (type: string), _col100 (type: string), _col98 (type: string), _col111 (type: string), _col33 (type: int), _col37 (type: int), _col35 (type: int), _col59 (type: string), '0' (type: string)
                                      mode: hash
                                      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
                                      Statistics: Num rows: -9223372036854775808 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                                      Reduce Output Operator
                                        key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int), _col5 (type: int), _col6 (type: int), _col7 (type: string), _col8 (type: string)
                                        sort order: +++++++++
                                        Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int), _col5 (type: int), _col6 (type: int), _col7 (type: string), _col8 (type: string)
                                        Statistics: Num rows: -9223372036854775808 Data size: 0 Basic stats: NONE Column stats: COMPLETE
                                        value expressions: _col9 (type: double)
      

      Attachments

        1. HIVE-7991.1.patch
          2 kB
          Prasanth Jayachandran
        2. HIVE-7991.2.patch
          3 kB
          Prasanth Jayachandran

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            prasanth_j Prasanth Jayachandran Assign to me
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment