Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16290

Stats: StatsRulesProcFactory::evaluateComparator estimates are wrong when minValue == filterValue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 3.0.0
    • Statistics
    • None
    • Reviewed

    Description

      Issue:
      =====
      In StatsRulesProcFactory::evaluateCompator, when minValue is >= filtered value, it should return all rows. Currently, it returns numRows/3. This causes lesser number of reducers to be spun up in queries. E.g Q79 in TPC-DS.

      E.g: TPC-DS store table stats:
      =============================

      hive --orcfiledump hdfs://nn:8020/apps/hive/warehouse/tpcds_bin_partitioned_orc_1000.db/store/000000_0
      Stripe Statistics:
        Stripe 1:
          Column 0: count: 1002 hasNull: false
          Column 1: count: 1002 hasNull: false min: 1 max: 1002 sum: 502503
          Column 2: count: 1002 hasNull: false min: AAAAAAAAAABAAAAA max: AAAAAAAAPPBAAAAA sum: 16032
          Column 3: count: 1002 hasNull: false min:  max: 2001-03-13 sum: 9950
          Column 4: count: 1002 hasNull: false min:  max: 2001-03-12 sum: 5010
          Column 5: count: 273 hasNull: true min: 2450820 max: 2451313 sum: 669141525
          Column 6: count: 1002 hasNull: false min:  max: pri sum: 3916
          Column 7: count: 994 hasNull: true min: 200 max: 300 sum: 249970
          Column 8: count: 996 hasNull: true min: 5002549 max: 9997773 sum: 7382689071
          Column 9: count: 1002 hasNull: false min:  max: 8AM-8AM sum: 7088
      
      select compute_stats(s_employee_count, 16) from store;
      
      {"columntype":"Long","min":200,"max":300,"countnulls":8,"numdistinctvalues":63,"ndvbitvector":"{0, 1, 2, 3, 4, 5, 11, 12}{0, 1, 2, 3, 6}{0, 1, 2, 3, 4, 5, 7, 11}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6}{0, 1, 2, 3, 4, 5, 8}{0, 1, 2, 3, 4}{0, 1, 2, 3, 4, 5, 7, 9}{0, 1, 2, 3, 4}{0}{0, 1, 2, 3, 4, 5, 7}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 8, 9, 14}{0, 1, 2, 3, 5}{0, 1, 2, 3, 4, 5, 6, 7}{0, 1, 2, 3, 4, 5, 6, 8}"}
      
      explain select count(s_store_sk) from store where s_number_employees > 200 and s_number_employees < 295;
      

      Above query would first apply 1002/3 = 334 for s_number_employees > 200 and then 334 / 3 = 111 for s_number_employees < 295. Ideally it should return all 1002 rows for filter s_number_employees > 200.

      In TPC-DS Q79, this causes too less reduce tasks to be spun up causing runtime delays.

      Attachments

        1. HIVE-16290.1.patch
          2 kB
          Rajesh Balamohan
        2. HIVE-16290.2.patch
          11 kB
          Gopal Vijayaraghavan

        Activity

          People

            rajesh.balamohan Rajesh Balamohan
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: