Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23954

count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

    XMLWordPrintableJSON

Details

    Description

      select count(*), count(distinct mid) from db1.table1 where partitioned_column = '...'

       

      is not working properly when hive.optimize.countdistinct is true. By default, it's true for all 3.x versions.

      In the two plans below, the aggregations part in the Output of Group By Operator of Map 1 are different.

       

      - hive.optimize.countdistinct=false

      +----------------------------------------------------+
      |                      Explain                       |
      +----------------------------------------------------+
      | Plan optimized by CBO.                             |
      |                                                    |
      | Vertex dependency in root stage                    |
      | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
      |                                                    |
      | Stage-0                                            |
      |   Fetch Operator                                   |
      |     limit:-1                                       |
      |     Stage-1                                        |
      |       Reducer 2                                    |
      |       File Output Operator [FS_7]                  |
      |         Group By Operator [GBY_5] (rows=1 width=24) |
      |           Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT KEY._col0:0._col0)"] |
      |         <-Map 1 [SIMPLE_EDGE]                      |
      |           SHUFFLE [RS_4]                           |
      |             Group By Operator [GBY_3] (rows=343640771 width=4160) |
      |               Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT mid)"],keys:mid |
      |               Select Operator [SEL_2] (rows=343640771 width=4160) |
      |                 Output:["mid"]                     |
      |                 TableScan [TS_0] (rows=343640771 width=4160) |
      |                   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
      |                                                    |
      +----------------------------------------------------+

       

      - hive.optimize.countdistinct=true

      +----------------------------------------------------+
      |                      Explain                       |
      +----------------------------------------------------+
      | Plan optimized by CBO.                             |
      |                                                    |
      | Vertex dependency in root stage                    |
      | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
      |                                                    |
      | Stage-0                                            |
      |   Fetch Operator                                   |
      |     limit:-1                                       |
      |     Stage-1                                        |
      |       Reducer 2                                    |
      |       File Output Operator [FS_7]                  |
      |         Group By Operator [GBY_14] (rows=1 width=16) |
      |           Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
      |           Group By Operator [GBY_11] (rows=343640771 width=4160) |
      |             Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
      |           <-Map 1 [SIMPLE_EDGE]                    |
      |             SHUFFLE [RS_10]                        |
      |               PartitionCols:_col0                  |
      |               Group By Operator [GBY_9] (rows=343640771 width=4160) |
      |                 Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
      |                 Select Operator [SEL_2] (rows=343640771 width=4160) |
      |                   Output:["mid"]                   |
      |                   TableScan [TS_0] (rows=343640771 width=4160) |
      |                     db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
      |                                                    |
      +----------------------------------------------------+
      

      Attachments

        1. HIVE-23954.01.patch
          3 kB
          Eugene Chung
        2. HIVE-23954.01.patch
          3 kB
          Eugene Chung

        Issue Links

          Activity

            People

              euigeun_chung Eugene Chung
              euigeun_chung Eugene Chung
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m