Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22269

Stats miss with "hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) leads to wrong reducer count

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 4.0.0-alpha-1
    • Statistics
    • None

    Description

      hive.optimize.sort.dynamic.partition=true introduces new stage to reduce number of writes in dynamic partitioning usecase. Earlier SortedDynPartitionOptimizer added this new operator via Optimizer.java and the stats for the newly added operator was populated via StatsRulesProcFactory$ReduceSinkStatsRule.

      However, with "HIVE-20703" this got changed. This is moved to TezCompiler for cost based decision. Though the operator gets added correctly, the stats for this does not get added (as it runs after runStatsAnnotation()).

      This causes reducer count to be mis-estimated in the query.

      e.g For the following query, reducer_2 would be estimated as "2" instead of "1009". This causes huge delay in the runtime.
      
      explain 
      from tpcds_xtext_1000.store_sales ss
      insert overwrite table store_sales partition (ss_sold_date_sk)
      select
              ss.ss_sold_time_sk,
              ss.ss_item_sk,
              ss.ss_customer_sk,
              ss.ss_cdemo_sk,
              ss.ss_hdemo_sk,
              ss.ss_addr_sk,
              ss.ss_store_sk,
              ss.ss_promo_sk,
              ss.ss_ticket_number,
              ss.ss_quantity,
              ss.ss_wholesale_cost,
              ss.ss_list_price,
              ss.ss_sales_price,
              ss.ss_ext_discount_amt,
              ss.ss_ext_sales_price,
              ss.ss_ext_wholesale_cost,
              ss.ss_ext_list_price,
              ss.ss_ext_tax,
              ss.ss_coupon_amt,
              ss.ss_net_paid,
              ss.ss_net_paid_inc_tax,
              ss.ss_net_profit,
              ss.ss_sold_date_sk
              where ss.ss_sold_date_sk is not null
      insert overwrite table store_sales partition (ss_sold_date_sk)
      select
              ss.ss_sold_time_sk,
              ss.ss_item_sk,
              ss.ss_customer_sk,
              ss.ss_cdemo_sk,
              ss.ss_hdemo_sk,
              ss.ss_addr_sk,
              ss.ss_store_sk,
              ss.ss_promo_sk,
              ss.ss_ticket_number,
              ss.ss_quantity,
              ss.ss_wholesale_cost,
              ss.ss_list_price,
              ss.ss_sales_price,
              ss.ss_ext_discount_amt,
              ss.ss_ext_sales_price,
              ss.ss_ext_wholesale_cost,
              ss.ss_ext_list_price,
              ss.ss_ext_tax,
              ss.ss_coupon_amt,
              ss.ss_net_paid,
              ss.ss_net_paid_inc_tax,
              ss.ss_net_profit,
              ss.ss_sold_date_sk
              where ss.ss_sold_date_sk is null
              distribute by ss.ss_item_sk
      ;
      

      Attachments

        1. HIVE-22269.2.patch
          76 kB
          Gopal Vijayaraghavan
        2. HIVE-22269.1.patch
          0.9 kB
          Rajesh Balamohan

        Issue Links

          Activity

            People

              rajesh.balamohan Rajesh Balamohan
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: