Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24595

Vectorization causing incorrect results for scalar subquery

    XMLWordPrintableJSON

Details

    Description

      Repro

       CREATE EXTERNAL TABLE `alltypessmall`(             
         `id` int,                                        
         `bool_col` boolean,                              
         `tinyint_col` tinyint,                           
         `smallint_col` smallint,                         
         `int_col` int,                                   
         `bigint_col` bigint,                             
         `float_col` float,                               
         `double_col` double,                             
         `date_string_col` string,                        
         `string_col` string,                             
         `timestamp_col` timestamp)                       
       PARTITIONED BY (                                   
         `year` int,                                      
         `month` int)                                     
       ROW FORMAT SERDE                                   
         'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  
       WITH SERDEPROPERTIES (                             
         'escape.delim'='\\',                             
         'field.delim'=',',                               
         'serialization.format'=',')                      
       STORED AS INPUTFORMAT                              
         'org.apache.hadoop.mapred.TextInputFormat'       
       OUTPUTFORMAT                                       
         'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
       TBLPROPERTIES (                                    
         'DO_NOT_UPDATE_STATS'='true',                    
         'OBJCAPABILITIES'='EXTREAD,EXTWRITE',            
         'STATS_GENERATED'='TASK',                        
         'impala.lastComputeStatsTime'='1608312793',      
         'transient_lastDdlTime'='1608310442')            ;
      
      insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
      insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
      insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,40,3434,5.4,44.3,'str1','str2', '01-01-2001');
      

      Following query should fail but it succeeds

      SELECT id FROM alltypessmall
      WHERE int_col =
        (SELECT int_col
         FROM alltypessmall)
      ORDER BY id;
      

      Explain plan

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            DagId: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
            Edges:
              Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
              Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
            DagName: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
            Vertices:
              Map 1
                  Map Operator Tree:
                      TableScan
                        alias: alltypessmall
                        filterExpr: int_col is not null (type: boolean)
                        Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
                        Filter Operator
                          predicate: int_col is not null (type: boolean)
                          Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: id (type: int), int_col (type: int)
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
                            Map Join Operator
                              condition map:
                                   Inner Join 0 to 1
                              keys:
                                0
                                1
                              outputColumnNames: _col0, _col1
                              input vertices:
                                1 Reducer 4
                              Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
                              Map Join Operator
                                condition map:
                                     Inner Join 0 to 1
                                keys:
                                  0 _col1 (type: int)
                                  1 _col0 (type: int)
                                outputColumnNames: _col0
                                input vertices:
                                  1 Map 3
                                Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
                                Reduce Output Operator
                                  key expressions: _col0 (type: int)
                                  null sort order: z
                                  sort order: +
                                  Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Map 3
                  Map Operator Tree:
                      TableScan
                        alias: alltypessmall
                        Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
                        Select Operator
                          Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
                          Group By Operator
                            aggregations: count()
                            minReductionHashAggr: 0.6666666
                            mode: hash
                            outputColumnNames: _col0
                            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              null sort order:
                              sort order:
                              Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col0 (type: bigint)
                        Filter Operator
                          predicate: int_col is not null (type: boolean)
                          Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: int_col (type: int)
                            outputColumnNames: _col0
                            Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int)
                              null sort order: z
                              sort order: +
                              Map-reduce partition columns: _col0 (type: int)
                              Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
                  Execution mode: vectorized
              Reducer 2
                  Execution mode: vectorized
                  Reduce Operator Tree:
                    Select Operator
                      expressions: KEY.reducesinkkey0 (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
                        table:
                            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              Reducer 4
                  Execution mode: vectorized
                  Reduce Operator Tree:
                    Group By Operator
                      aggregations: count(VALUE._col0)
                      mode: mergepartial
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                      Filter Operator
                        predicate: (sq_count_check(_col0) <= 1) (type: boolean)
                        Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                        Select Operator
                          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            null sort order:
                            sort order:
                            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      

      Issue is that Map 3/Reducer 4 group by is producing incorrect count. Expected output in this case is 3 but actual seems to be 1.

       

      CBO plan

      HiveSortLimit(sort0=[$0], dir0=[ASC])
        HiveProject(id=[$0])
          HiveJoin(condition=[=($1, $3)], joinType=[inner], algorithm=[none], cost=[{6.0 rows, 0.0 cpu, 0.0 io}])
            HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[{4.0 rows, 0.0 cpu, 0.0 io}])
              HiveProject(id=[$0], int_col=[$4])
                HiveFilter(condition=[IS NOT NULL($4)])
                  HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
              HiveProject(cnt=[$0])
                HiveFilter(condition=[<=(sq_count_check($0), 1)])
                  HiveProject(cnt=[$0])
                    HiveAggregate(group=[{}], cnt=[COUNT()])
                      HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
            HiveProject(int_col=[$4])
              HiveFilter(condition=[IS NOT NULL($4)])
                HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
      

       

      Attachments

        Issue Links

          Activity

            People

              mustafaiman Mustafa İman
              vgarg Vineet Garg
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h