Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24764

insert overwrite on a partition resets row count stats in other partitions

    XMLWordPrintableJSON

Details

    Description

      After insert overwrite on a partition, stats on other partitions are messed up. Subsequent queries end up with plans with PARTIAL stats. In certain cases, this leads to suboptimal query plans.

      drop table if exists test_stats;
      drop table if exists test_stats_2;
      
      create table test_stats(i int, j bigint);
      create table test_stats_2(i int) partitioned by (j bigint);
      
      insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
      
      -- select * from test_stats;
      1   1
      2   2
      3   3
      4   4
      5   <null>
      
      
      insert overwrite table test_stats_2 partition(j)  select i, j from test_stats where j is not null;
      
      
      -- After executing this statement, stat gets messed up.
      insert overwrite table test_stats_2 partition(j)  select i, j from test_stats where j is null;
      
      -- select * from test_stats_2;
      1   1
      2   2
      3   3
      4   4
      5   <null>
      
      
      -- This would return "PARTIAL" stats instead of "COMPLETE"
      explain select i, count(*) as c from test_stats_2 group by i order by c desc limit 10;
      
      
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
            DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: test_stats_2
                        Statistics: Num rows: 125 Data size: 500 Basic stats: PARTIAL Column stats: COMPLETE
                        Select Operator
                          expressions: i (type: int)
                          outputColumnNames: i
                          Statistics: Num rows: 125 Data size: 500 Basic stats: PARTIAL Column stats: COMPLETE
                          Group By Operator
                            aggregations: count()
                            keys: i (type: int)
                            minReductionHashAggr: 0.99
                            mode: hash
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 125 Data size: 1500 Basic stats: PARTIAL Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              null sort order: a
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 125 Data size: 1500 Basic stats: PARTIAL Column stats: COMPLETE
                              value expressions: _col1 (type: bigint)
                  Execution mode: vectorized, llap
                  LLAP IO: may be used (ACID table)
              Reducer 2 
                  Execution mode: vectorized, llap
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: count(VALUE._col0)
                      keys: KEY._col0 (type: int)
                      mode: mergepartial
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE
                      Top N Key Operator
                        sort order: -
                        keys: _col1 (type: bigint)
                        null sort order: a
                        Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE
                        top n: 10
                        Reduce Output Operator
                          key expressions: _col1 (type: bigint)
                          null sort order: a
                          sort order: -
                          Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE
                          TopN Hash Memory Usage: 0.04
                          value expressions: _col0 (type: int)
              Reducer 3 
                  Execution mode: vectorized, llap
                  Reduce Operator Tree:
                    Select Operator
                      expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 (type: bigint)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE
                      Limit
                        Number of rows: 10
                        Statistics: Num rows: 10 Data size: 120 Basic stats: PARTIAL Column stats: COMPLETE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 10 Data size: 120 Basic stats: PARTIAL Column stats: COMPLETE
                          table:
                              input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: 10
            Processor Tree:
              ListSink
      

      https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsAggregator.java#L138

      FSStats should return "null" or "" if partitions are not present in its list.

      Attachments

        Issue Links

          Activity

            People

              rajesh.balamohan Rajesh Balamohan
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m