Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-10841

[WHERE col is not null] does not work sometimes for queries with many JOIN statements

    XMLWordPrintableJSON

Details

    Description

      The result from the following SELECT query is 3 rows but it should be 1 row.
      I checked it in MySQL - it returned 1 row.

      To reproduce the issue in Hive
      1. prepare tables

      drop table if exists L;
      drop table if exists LA;
      drop table if exists FR;
      drop table if exists A;
      drop table if exists PI;
      drop table if exists acct;
      
      create table L as select 4436 id;
      create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
      create table FR as select 4436 loan_id;
      create table A as select 4748 id;
      create table PI as select 4415 id;
      
      create table acct as select 4748 aid, 10 acc_n, 122 brn;
      insert into table acct values(4748, null, null);
      insert into table acct values(4748, null, null);
      

      2. run SELECT query

      select
        acct.ACC_N,
        acct.brn
      FROM L
      JOIN LA ON L.id = LA.loan_id
      JOIN FR ON L.id = FR.loan_id
      JOIN A ON LA.aid = A.id
      JOIN PI ON PI.id = LA.pi_id
      JOIN acct ON A.id = acct.aid
      WHERE
        L.id = 4436
        and acct.brn is not null;
      

      the result is 3 rows

      10	122
      NULL	NULL
      NULL	NULL
      

      but it should be 1 row

      10	122
      

      2.1 "explain select ..." output for hive-1.3.0 MR

      STAGE DEPENDENCIES:
        Stage-12 is a root stage
        Stage-9 depends on stages: Stage-12
        Stage-0 depends on stages: Stage-9
      
      STAGE PLANS:
        Stage: Stage-12
          Map Reduce Local Work
            Alias -> Map Local Tables:
              a 
                Fetch Operator
                  limit: -1
              acct 
                Fetch Operator
                  limit: -1
              fr 
                Fetch Operator
                  limit: -1
              l 
                Fetch Operator
                  limit: -1
              pi 
                Fetch Operator
                  limit: -1
            Alias -> Map Local Operator Tree:
              a 
                TableScan
                  alias: a
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 _col5 (type: int)
                        1 id (type: int)
                        2 aid (type: int)
              acct 
                TableScan
                  alias: acct
                  Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: aid is not null (type: boolean)
                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 _col5 (type: int)
                        1 id (type: int)
                        2 aid (type: int)
              fr 
                TableScan
                  alias: fr
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (loan_id = 4436) (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 4436 (type: int)
                        1 4436 (type: int)
                        2 4436 (type: int)
              l 
                TableScan
                  alias: l
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (id = 4436) (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 4436 (type: int)
                        1 4436 (type: int)
                        2 4436 (type: int)
              pi 
                TableScan
                  alias: pi
                  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                    HashTable Sink Operator
                      keys:
                        0 _col6 (type: int)
                        1 id (type: int)
      
        Stage: Stage-9
          Map Reduce
            Map Operator Tree:
                TableScan
                  alias: la
                  Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (((loan_id is not null and aid is not null) and pi_id is not null) and (loan_id = 4436)) (type: boolean)
                    Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 0 to 2
                      keys:
                        0 4436 (type: int)
                        1 4436 (type: int)
                        2 4436 (type: int)
                      outputColumnNames: _col5, _col6
                      Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                             Inner Join 1 to 2
                        keys:
                          0 _col5 (type: int)
                          1 id (type: int)
                          2 aid (type: int)
                        outputColumnNames: _col6, _col19, _col20
                        Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col6 (type: int)
                            1 id (type: int)
                          outputColumnNames: _col19, _col20
                          Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col19 (type: int), _col20 (type: int)
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
                            File Output Operator
                              compressed: false
                              Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column stats: NONE
                              table:
                                  input format: org.apache.hadoop.mapred.TextInputFormat
                                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
            Processor Tree:
              ListSink
      
      Time taken: 0.57 seconds, Fetched: 142 row(s)
      

      2.2. "explain select..." output for hive-0.13.1 Tez

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)
              Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8 (SIMPLE_EDGE)
            DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1
            Vertices:
              Map 1 
                  Map Operator Tree:
                      TableScan
                        alias: acct
                        Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: aid (type: int)
                          sort order: +
                          Map-reduce partition columns: aid (type: int)
                          Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
                          value expressions: acc_n (type: int), brn (type: int)
              Map 4 
                  Map Operator Tree:
                      TableScan
                        alias: a
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: id (type: int)
                          sort order: +
                          Map-reduce partition columns: id (type: int)
                          Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
              Map 5 
                  Map Operator Tree:
                      TableScan
                        alias: la
                        Statistics: Num rows: 28 Data size: 347 Basic stats: COMPLETE Column stats: NONE
                        Filter Operator
                          predicate: (loan_id = 4436) (type: boolean)
                          Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: loan_id (type: int)
                            sort order: +
                            Map-reduce partition columns: loan_id (type: int)
                            Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column stats: NONE
                            value expressions: aid (type: int), pi_id (type: int)
              Map 7 
                  Map Operator Tree:
                      TableScan
                        alias: fr
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Filter Operator
                          predicate: (loan_id = 4436) (type: boolean)
                          Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: loan_id (type: int)
                            sort order: +
                            Map-reduce partition columns: loan_id (type: int)
                            Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Map 8 
                  Map Operator Tree:
                      TableScan
                        alias: l
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Filter Operator
                          predicate: (id = 4436) (type: boolean)
                          Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
                          Reduce Output Operator
                            key expressions: id (type: int)
                            sort order: +
                            Map-reduce partition columns: id (type: int)
                            Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column stats: NONE
              Map 9 
                  Map Operator Tree:
                      TableScan
                        alias: pi
                        Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: id (type: int)
                          sort order: +
                          Map-reduce partition columns: id (type: int)
                          Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column stats: NONE
              Reducer 2 
                  Reduce Operator Tree:
                    Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 1 to 2
                      condition expressions:
                        0 {VALUE._col2}
                        1 
                        2 {VALUE._col1} {VALUE._col2}
                      outputColumnNames: _col2, _col15, _col16
                      Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col2 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col2 (type: int)
                        Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col15 (type: int), _col16 (type: int)
              Reducer 3 
                  Reduce Operator Tree:
                    Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {VALUE._col1} {VALUE._col2}
                        1 
                      outputColumnNames: _col1, _col2
                      Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col1 (type: int), _col2 (type: int)
                        outputColumnNames: _col0, _col1
                        Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              Reducer 6 
                  Reduce Operator Tree:
                    Join Operator
                      condition map:
                           Inner Join 0 to 1
                           Inner Join 0 to 2
                      condition expressions:
                        0 
                        1 {VALUE._col1} {VALUE._col2}
                        2 
                      outputColumnNames: _col4, _col5
                      Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col4 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col4 (type: int)
                        Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col5 (type: int)
      
        Stage: Stage-0
          Fetch Operator
            limit: -1
      
      Time taken: 1.377 seconds, Fetched: 146 row(s)
      

      3. The workaround is to put "acct.brn is not null" to join condition

      select
        acct.ACC_N,
        acct.brn
      FROM L
      JOIN LA ON L.id = LA.loan_id
      JOIN FR ON L.id = FR.loan_id
      JOIN A ON LA.aid = A.id
      JOIN PI ON PI.id = LA.pi_id
      JOIN acct ON A.id = acct.aid and acct.brn is not null
      WHERE
        L.id = 4436;
      
      OK
      10	122
      Time taken: 23.479 seconds, Fetched: 1 row(s)
      

      I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations have the issue

      Attachments

        1. HIVE-10841.03.patch
          16 kB
          jcamachorodriguez
        2. HIVE-10841.1.patch
          3 kB
          Laljo John Pullokkaran
        3. HIVE-10841.2.patch
          2 kB
          Laljo John Pullokkaran
        4. HIVE-10841.patch
          0.7 kB
          Laljo John Pullokkaran

        Issue Links

          Activity

            People

              jpullokkaran Laljo John Pullokkaran
              apivovarov Alexander Pivovarov
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: