Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-12165

wrong result when hive.optimize.sampling.orderby=true with some aggregate functions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.0.1, 2.1.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      hortonworks 2.3

      Description

      This simple query give wrong result , when , i use the parallel order .

      select count(*) , count(distinct dummyint ) , min(dummyint),max(dummyint) from foobar_1M ;
      

      Current wrong result :

      c0	c1	c2	c3
      32740	32740	0	163695
      113172	113172	163700	729555
      54088	54088	729560	999995
      

      Right result :

      c0	c1	c2	c3
      1000000	1000000	0	999999
      

      The sql script for my test

      drop table foobar_1 ;
      create table foobar_1 ( dummyint int  , dummystr string ) ;
      insert into table foobar_1 select count(*),'dummy 0'  from foobar_1 ;
      
      drop table foobar_1M ;
      create table foobar_1M ( dummyint bigint  , dummystr string ) ;
      
      insert overwrite table foobar_1M
             select val_int  , concat('dummy ',val_int) from
                   ( select ((((((d_1*10)+d_2)*10+d_3)*10+d_4)*10+d_5)*10+d_6) as val_int from foobar_1
                       lateral view outer explode(split("0,1,2,3,4,5,6,7,8,9",",")) tbl_1 as d_1
                       lateral view outer explode(split("0,1,2,3,4,5,6,7,8,9",",")) tbl_2 as d_2
                       lateral view outer explode(split("0,1,2,3,4,5,6,7,8,9",",")) tbl_3 as d_3
                       lateral view outer explode(split("0,1,2,3,4,5,6,7,8,9",",")) tbl_4 as d_4
                       lateral view outer explode(split("0,1,2,3,4,5,6,7,8,9",",")) tbl_5 as d_5
                       lateral view outer explode(split("0,1,2,3,4,5,6,7,8,9",",")) tbl_6 as d_6  ) as f                 ;
      
      
      set hive.optimize.sampling.orderby.number=10000;
      set hive.optimize.sampling.orderby.percent=0.1f;
      set mapreduce.job.reduces=3 ;
      
      set hive.optimize.sampling.orderby=false;
      
      select count(*) , count(distinct dummyint ) , min(dummyint),max(dummyint) from foobar_1M ;
      
      set hive.optimize.sampling.orderby=true;
      
      select count(*) , count(distinct dummyint ) , min(dummyint),max(dummyint) from foobar_1M ;
      

        Attachments

          Activity

            People

            • Assignee:
              aihuaxu Aihua Xu
              Reporter:
              ErwanMAS ErwanMAS
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: