Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20660

Group by statistics estimation could be improved by bounding the total number of rows to source table

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0-alpha-1
    • Statistics
    • None

    Description

      Currently the stats for group by is estimated by taking product of NDVs of all the keys and bounding it by the number of rows of its input. This bound could be improved by using the source table instead of immediate input, the insight in this case is that cardinality/ndvs of a table can not go beyond the original (outer joins will only add NULLs thereby increasing the cardinality by 1).

      Note that the assumption here is that group by keys all belong to the same source table/input.
      This will improve the estimation in situations where group by is executed after joins wherein Hive could end up estimating the number of rows.

      Reproducer

      set hive.stats.fetch.column.stats=true;
      
      create table t1(i int, j int);
      alter table t1 update statistics set('numRows'='10000', 'rawDataSize'='18000');
      alter table t1 update statistics for column i set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
      alter table t1 update statistics for column j set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
      
      create table t2(i2 int, j2 int);
      alter table t2 update statistics set('numRows'='100000000', 'rawDataSize'='10000');
      alter table t2 update statistics for column i2 set('numDVs'='10000000','numNulls'='0','highValue'='8000','lowValue'='0');
      alter table t2 update statistics for column j2 set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
      
      explain select count (1) from t1,t2
      	where t1.j=t2.i2 
      group by t1.i, t1.j;
      
      Reducer 2
                  Reduce Operator Tree:
                    Merge Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 _col1 (type: int)
                        1 _col0 (type: int)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 99700 Data size: 797288 Basic stats: COMPLETE Column stats: COMPLETE
                      Group By Operator
                        aggregations: count()
                        keys: _col0 (type: int), _col1 (type: int)
                        mode: hash
                        outputColumnNames: _col0, _col1, _col2
                        Statistics: Num rows: 49850 Data size: 797448 Basic stats: COMPLETE Column stats: COMPLETE <==========
                        Reduce Output Operator
                          key expressions: _col0 (type: int), _col1 (type: int)
                          sort order: ++
                          Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
                          Statistics: Num rows: 49850 Data size: 797448 Basic stats: COMPLETE Column stats: COMPLETE
                          value expressions: _col2 (type: bigint)
        .....................
      

      Attachments

        1. HIVE-20660.1.patch
          19 kB
          Vineet Garg
        2. HIVE-20660.2.patch
          17 kB
          Vineet Garg
        3. HIVE-20660.3.patch
          18 kB
          Vineet Garg
        4. HIVE-20660.4.patch
          18 kB
          Vineet Garg
        5. HIVE-20660.5.patch
          18 kB
          Vineet Garg
        6. HIVE-20660.6.patch
          18 kB
          Vineet Garg
        7. HIVE-20660.7.patch
          18 kB
          Vineet Garg

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            vgarg Vineet Garg Assign to me
            vgarg Vineet Garg
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment