Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-13254

GBY cardinality estimation is wrong partition columns is involved

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.3.0, 2.0.0, 2.1.0
    • None
    • Hive
    • None

    Description

      When running the following query on TPCDS-1000 scale, setting hive.transpose.aggr.join=true is expected to generate optimal plan but it was not generating.

      Query
      SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
             `item`.`i_category`     AS `i_category` 
      FROM   `store_sales` `store_sales` 
             INNER JOIN `item` `item` 
                     ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
             INNER JOIN `date_dim` `date_dim` 
                     ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` ) 
      GROUP  BY `d_day_name`, 
                `i_category`;
      

      The reason for that is stats annotation rule for GROUP BY is not considering partition column into account. For the above query, the generated plan is attached. As we can see from the plan, GBY is pushed to fact table (store_sales) but that output of GBY shuffled to perform join instead of MapJoin conversion. This is because of wrong estimation of cardinality/data size of GBY on store_sales (Map 1).

      What's happening internally is, GBY computes estimated cardinality which in this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. This estimate is wrong as ss_sold_date_sk is partition column and estimator assumes its non-partition column. In this case, not every tasks reads data from all partitions. We need to take estimated task parallelism into account. For example: If task parallelism is determined to be 100 the estimate from GBY should be ~6M which should convert this vertex into map join vertex.

      Attachments

        1. q3.svg
          59 kB
          Prasanth Jayachandran
        2. q3_ef_transpose_aggr.svg
          98 kB
          Prasanth Jayachandran

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            prasanth_j Prasanth Jayachandran

            Dates

              Created:
              Updated:

              Slack

                Issue deployment