Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.0
-
None
Description
This one is tricky and I dont really understand what's causing this but the results are wrong and its reproducible.
To reproduce, create the two tables below:
[localhost:21000] > create table foo as select 1 int_col, true bool_col union select 1, false union select 1, null; Query: create table foo as select 1 int_col, true bool_col union select 1, false union select 1, null +-------------------+ | summary | +-------------------+ | Inserted 3 row(s) | +-------------------+ Returned 1 row(s) in 0.87s [localhost:21000] > select * from foo; Query: select * from foo +---------+----------+ | int_col | bool_col | +---------+----------+ | 1 | false | | 1 | true | | 1 | NULL | +---------+----------+ Returned 3 row(s) in 0.64s [localhost:21000] > create table bar as select * from alltypes union all select * from alltypes; Query: create table bar as select * from alltypes union all select * from alltypes +-----------------------+ | summary | +-----------------------+ | Inserted 14600 row(s) | +-----------------------+ Returned 1 row(s) in 1.35s
Now you can see incorrect results:
[localhost:21000] > select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10; Query: select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10 +----------+----------+----------+ | bool_col | bool_col | bool_col | +----------+----------+----------+ | false | true | true | | false | true | true | | false | true | true | | false | true | true | | false | true | true | | false | true | true | | false | true | true | | false | true | true | | false | true | true | | false | true | true | +----------+----------+----------+ Returned 10 row(s) in 1.09s [localhost:21000] > explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10; Query: explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join bar t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=4.06GB VCores=3 | | WARNING: The following tables are missing relevant table and/or column statistics. | | functional.bar, functional.foo | | | | 07:EXCHANGE [UNPARTITIONED] | | | limit: 10 | | | | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: t2.bool_col = t3.bool_col | | | limit: 10 | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | 02:SCAN HDFS [functional.bar t3] | | | partitions=1/1 size=1.25MB | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: t1.int_col = t2.int_col | | | | | |--05:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [functional.foo t2] | | | partitions=1/1 size=20B | | | | | 00:SCAN HDFS [functional.foo t1] | | partitions=1/1 size=20B | +------------------------------------------------------------------------------------+ Returned 26 row(s) in 0.01s
but if the query in run against the non-union version of "bar", ie "alltypes", then the results are correct.
[localhost:21000] > select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10; Query: select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10 +----------+----------+----------+ | bool_col | bool_col | bool_col | +----------+----------+----------+ | true | true | true | | false | false | false | | true | true | true | | false | false | false | | true | true | true | | false | false | false | | true | true | true | | false | false | false | | true | true | true | | false | false | false | +----------+----------+----------+ Returned 10 row(s) in 1.18s [localhost:21000] > explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10; Query: explain select t1.bool_col, t2.bool_col, t3.bool_col from foo t1 join foo t2 on t2.int_col = t1.int_col join alltypes t3 on t3.bool_col = t2.bool_col and t3.bool_col = t1.bool_col limit 10 +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=4.16GB VCores=3 | | WARNING: The following tables are missing relevant table and/or column statistics. | | functional.foo | | | | 07:EXCHANGE [UNPARTITIONED] | | | limit: 10 | | | | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: t1.int_col = t2.int_col, t3.bool_col = t2.bool_col | | | limit: 10 | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [functional.foo t2] | | | partitions=1/1 size=20B | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: t3.bool_col = t1.bool_col | | | | | |--05:EXCHANGE [BROADCAST] | | | | | | | 00:SCAN HDFS [functional.foo t1] | | | partitions=1/1 size=20B | | | | | 02:SCAN HDFS [functional.alltypes t3] | | partitions=24/24 size=478.45KB | +------------------------------------------------------------------------------------+ Returned 26 row(s) in 0.01s
So I guess the problem has something to do with the table sizes involved.