Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.2
-
None
Description
We have run into a situation where nested views will cause results to be inconsistent with the results of running the queries in the views directly.
To reproduce...
In Impala - Create tables and some data
> use u_alanj; > create table names(id string, name string); > insert into names(id, name) values("A", "Customer A"); > insert into names(id, name) values("B", "Customer B"); > create table vals(id string, value integer);
In Bash - Create more data
$ for x in A B C; do for y in 1 2 3; do echo "insert into vals(id, value) values('$x', $y);"; done; done > createData.sql $ impala-shell -f createData.sql
In Impala - Create views and demonstrate bad query
> use u_alanj; > create view test_join as select v.id, v.value, n.name from u_alanj.vals v left outer join u_alanj.names n on v.id = n.id; > create view test_join2 as select j.name, sum(j.value) as total from test_join j group by j.name; > select * from test_join2 where name='Customer A'; Query: select * from test_join2 where name='Customer A' +------------+-------+ | name | total | +------------+-------+ | Customer A | 6 | | NULL | 12 | +------------+-------+ Fetched 2 row(s) in 1.30s > select j.name, sum(j.value) as total from test_join j where j.name = 'Customer A' group by j.name; Query: select j.name, sum(j.value) as total from test_join j where j.name = 'Customer A' group by j.name +------------+-------+ | name | total | +------------+-------+ | Customer A | 6 | +------------+-------+ Fetched 1 row(s) in 0.15s
In the version using the two views, we get null results.
Broken Explain Results
+------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=2.16GB VCores=2 | | WARNING: The following tables are missing relevant table and/or column statistics. | | u_alanj.names, u_alanj.vals | | | | 07:EXCHANGE [UNPARTITIONED] | | | | | 06:AGGREGATE [FINALIZE] | | | output: sum:merge(j.value) | | | group by: j.name | | | | | 05:EXCHANGE [HASH(j.name)] | | | | | 03:AGGREGATE | | | output: sum(v.value) | | | group by: n.name | | | | | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST] | | | hash predicates: v.id = n.id | | | | | |--04:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [u_alanj.names n] | | | partitions=1/1 files=2 size=26B | | | predicates: n.name = 'Customer A' | | | | | 00:SCAN HDFS [u_alanj.vals v] | | partitions=1/1 files=9 size=36B | +------------------------------------------------------------------------------------+ Fetched 27 row(s) in 0.05s
Working Explain Results
+------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=2.16GB VCores=2 | | WARNING: The following tables are missing relevant table and/or column statistics. | | u_alanj.names, u_alanj.vals | | | | 07:EXCHANGE [UNPARTITIONED] | | | | | 06:AGGREGATE [FINALIZE] | | | output: sum:merge(j.value) | | | group by: j.name | | | | | 05:EXCHANGE [HASH(j.name)] | | | | | 03:AGGREGATE | | | output: sum(v.value) | | | group by: n.name | | | | | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST] | | | hash predicates: v.id = n.id | | | other predicates: n.name = 'Customer A' | | | | | |--04:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [u_alanj.names n] | | | partitions=1/1 files=2 size=26B | | | predicates: n.name = 'Customer A' | | | | | 00:SCAN HDFS [u_alanj.vals v] | | partitions=1/1 files=9 size=36B | +------------------------------------------------------------------------------------+ Fetched 28 row(s) in 0.03s
The important difference there is that in the working version, "other predicates: n.name = 'Customer A'" shows up in the hash join. I believe that impala isn't pushing the predicates into the join the way it needs to to give proper results in this situation.
The view works consistently in hive.