Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-13254

GBY cardinality estimation is wrong partition columns is involved

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.3.0, 2.0.0, 2.1.0
    • None
    • Hive
    • None

    Description

      When running the following query on TPCDS-1000 scale, setting hive.transpose.aggr.join=true is expected to generate optimal plan but it was not generating.

      Query
      SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
             `item`.`i_category`     AS `i_category` 
      FROM   `store_sales` `store_sales` 
             INNER JOIN `item` `item` 
                     ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
             INNER JOIN `date_dim` `date_dim` 
                     ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` ) 
      GROUP  BY `d_day_name`, 
                `i_category`;
      

      The reason for that is stats annotation rule for GROUP BY is not considering partition column into account. For the above query, the generated plan is attached. As we can see from the plan, GBY is pushed to fact table (store_sales) but that output of GBY shuffled to perform join instead of MapJoin conversion. This is because of wrong estimation of cardinality/data size of GBY on store_sales (Map 1).

      What's happening internally is, GBY computes estimated cardinality which in this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. This estimate is wrong as ss_sold_date_sk is partition column and estimator assumes its non-partition column. In this case, not every tasks reads data from all partitions. We need to take estimated task parallelism into account. For example: If task parallelism is determined to be 100 the estimate from GBY should be ~6M which should convert this vertex into map join vertex.

      Attachments

        1. q3_ef_transpose_aggr.svg
          98 kB
          Prasanth Jayachandran
        2. q3.svg
          59 kB
          Prasanth Jayachandran

        Activity

          People

            Unassigned Unassigned
            prasanth_j Prasanth Jayachandran
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: