Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23788

FilterStatsRule misestimate causes hashtable computation to rehash often

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Depending on available statistics, FilterStatsRule estimates the rows as numRows/3 at times. This causes, lower keyCount to be projected for hashtable computation causing rehashing often.

      https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L952

      https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1192

      E.g TPCDS Q74 @ 10TB. But as part of evaluating "t_s_firstyear.year_total > 0, t_w_secyear.year_total / t_w_firstyear.year_total , t_s_secyear.year_total / t_s_firstyear.year_total " conditions, it projects 1/3rd of the rows causing rehashing of hashtable in downstream vertex.

      May have to check whether stats can be projected for these columns correctly.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: