Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1102

Query with self joined table may produce incorrect results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 2.0
    • Impala 2.0
    • None

    Description

      This one is tricky and I dont really understand what's causing this but the results are wrong and its reproducible.

      To reproduce, create the two tables below:

      [localhost:21000] > create table foo as select 1 int_col, true bool_col union select 1, false union select 1, null;
      Query: create table foo as select 1 int_col, true bool_col union select 1, false union select 1, null
      +-------------------+
      | summary           |
      +-------------------+
      | Inserted 3 row(s) |
      +-------------------+
      Returned 1 row(s) in 0.87s
      [localhost:21000] > select * from foo;
      Query: select * from foo
      +---------+----------+
      | int_col | bool_col |
      +---------+----------+
      | 1       | false    |
      | 1       | true     |
      | 1       | NULL     |
      +---------+----------+
      Returned 3 row(s) in 0.64s
      
      
      [localhost:21000] > create table bar as select * from alltypes union all select * from alltypes;
      Query: create table bar as select * from alltypes union all select * from alltypes
      +-----------------------+
      | summary               |
      +-----------------------+
      | Inserted 14600 row(s) |
      +-----------------------+
      Returned 1 row(s) in 1.35s
      

      Now you can see incorrect results:

      [localhost:21000] > select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10;
      Query: select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10
      +----------+----------+----------+
      | bool_col | bool_col | bool_col |
      +----------+----------+----------+
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      | false    | true     | true     |
      +----------+----------+----------+
      Returned 10 row(s) in 1.09s
      
      
      [localhost:21000] > explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10;
      Query: explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10
      +------------------------------------------------------------------------------------+
      | Explain String                                                                     |
      +------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=4.06GB VCores=3                            |
      | WARNING: The following tables are missing relevant table and/or column statistics. |
      | functional.bar, functional.foo                                                     |
      |                                                                                    |
      | 07:EXCHANGE [UNPARTITIONED]                                                        |
      | |  limit: 10                                                                       |
      | |                                                                                  |
      | 04:HASH JOIN [INNER JOIN, BROADCAST]                                               |
      | |  hash predicates: t2.bool_col = t3.bool_col                                      |
      | |  limit: 10                                                                       |
      | |                                                                                  |
      | |--06:EXCHANGE [BROADCAST]                                                         |
      | |  |                                                                               |
      | |  02:SCAN HDFS [functional.bar t3]                                                |
      | |     partitions=1/1 size=1.25MB                                                   |
      | |                                                                                  |
      | 03:HASH JOIN [INNER JOIN, BROADCAST]                                               |
      | |  hash predicates: t1.int_col = t2.int_col                                        |
      | |                                                                                  |
      | |--05:EXCHANGE [BROADCAST]                                                         |
      | |  |                                                                               |
      | |  01:SCAN HDFS [functional.foo t2]                                                |
      | |     partitions=1/1 size=20B                                                      |
      | |                                                                                  |
      | 00:SCAN HDFS [functional.foo t1]                                                   |
      |    partitions=1/1 size=20B                                                         |
      +------------------------------------------------------------------------------------+
      Returned 26 row(s) in 0.01s
      

      but if the query in run against the non-union version of "bar", ie "alltypes", then the results are correct.

      [localhost:21000] > select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10;
      Query: select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10
      +----------+----------+----------+
      | bool_col | bool_col | bool_col |
      +----------+----------+----------+
      | true     | true     | true     |
      | false    | false    | false    |
      | true     | true     | true     |
      | false    | false    | false    |
      | true     | true     | true     |
      | false    | false    | false    |
      | true     | true     | true     |
      | false    | false    | false    |
      | true     | true     | true     |
      | false    | false    | false    |
      +----------+----------+----------+
      Returned 10 row(s) in 1.18s
      
      
      [localhost:21000] > explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10;
      Query: explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10
      +------------------------------------------------------------------------------------+
      | Explain String                                                                     |
      +------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=4.16GB VCores=3                            |
      | WARNING: The following tables are missing relevant table and/or column statistics. |
      | functional.foo                                                                     |
      |                                                                                    |
      | 07:EXCHANGE [UNPARTITIONED]                                                        |
      | |  limit: 10                                                                       |
      | |                                                                                  |
      | 04:HASH JOIN [INNER JOIN, BROADCAST]                                               |
      | |  hash predicates: t1.int_col = t2.int_col, t3.bool_col = t2.bool_col             |
      | |  limit: 10                                                                       |
      | |                                                                                  |
      | |--06:EXCHANGE [BROADCAST]                                                         |
      | |  |                                                                               |
      | |  01:SCAN HDFS [functional.foo t2]                                                |
      | |     partitions=1/1 size=20B                                                      |
      | |                                                                                  |
      | 03:HASH JOIN [INNER JOIN, BROADCAST]                                               |
      | |  hash predicates: t3.bool_col = t1.bool_col                                      |
      | |                                                                                  |
      | |--05:EXCHANGE [BROADCAST]                                                         |
      | |  |                                                                               |
      | |  00:SCAN HDFS [functional.foo t1]                                                |
      | |     partitions=1/1 size=20B                                                      |
      | |                                                                                  |
      | 02:SCAN HDFS [functional.alltypes t3]                                              |
      |    partitions=24/24 size=478.45KB                                                  |
      +------------------------------------------------------------------------------------+
      Returned 26 row(s) in 0.01s
      

      So I guess the problem has something to do with the table sizes involved.

      Attachments

        Activity

          People

            alex.behm Alexander Behm
            caseyc casey
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: