Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 3.4.0
-
None
-
None
-
Important
-
ghx-label-11
Description
When a query involves the join of views each created based on multiple tables, the inferred predicate(s) is(are) not assigned to the scan node(s). This issue is/seems related to https://issues.apache.org/jira/browse/IMPALA-4578#.
In the following a minimum example to reproduce the phenomenon.
CREATE TABLE default.pt1 ( c1 INT, c2 STRING ) STORED AS PARQUET; insert into pt1 values (1, 'one'); CREATE TABLE default.pt2 ( c1 INT, c2 STRING ) STORED AS PARQUET; insert into pt2 values (2, 'two'); CREATE TABLE default.pta1 ( c1a INT, c2a STRING ) STORED AS PARQUET; insert into pta1 values (1,'one'); CREATE TABLE default.pta2 ( c1a INT, c2a STRING ) STORED AS PARQUET; insert into pta2 values (2,'two'); CREATE VIEW myview_1_on_2_parquet_tables AS SELECT 'ONE' table_source, c1, c2 FROM `default`.pt1 UNION ALL SELECT 'TWO' table_source, c1, c2 FROM `default`.pt2; CREATE VIEW myview_2_on_2_parquet_tables AS SELECT 'ONE' table_source_a, c1a, c2a FROM `default`.pta1 UNION ALL SELECT 'TWO' table_source_a, c1a, c2a FROM `default`.pta2;
For easy reference, the contents of tables pt1, pt2, pta1, pta2, and views myview_1_on_2_tables, myview_2_on_2_tables are also given as follows.
Contents of table pt1 afterwards:
+----+-----+ | c1 | c2 | +----+-----+ | 1 | one | +----+-----+
Contents of table pt2 afterwards:
+----+-----+ | c1 | c2 | +----+-----+ | 2 | two | +----+-----+
Contents of table pta1 afterwards:
+-----+-----+ | c1a | c2a | +-----+-----+ | 1 | one | +-----+-----+
Contents of table pta2 afterwards:
+-----+-----+ | c1a | c2a | +-----+-----+ | 2 | two | +-----+-----+
Contents in myview_1_on_2_parquet_tables (union of tables t1 and t2):
+--------------+----+-----+ | table_source | c1 | c2 | +--------------+----+-----+ | ONE | 1 | one | | TWO | 2 | two | +--------------+----+-----+
Contents in myview_2_on_2_parquet_tables (union of tables ta1 and ta2):
+----------------+-----+-----+ | table_source_a | c1a | c2a | +----------------+-----+-----+ | ONE | 1 | one | | TWO | 2 | two | +----------------+-----+-----+
After creating the related tables and views described above, we consider the following 2 queries.
Query 1:
select * from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b where a.table_source = 'ONE' and a.table_source = b.table_source_a;
Query 2:
select * from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b where a.table_source = 'ONE' and b.table_source_a = 'ONE' and a.table_source = b.table_source_a;
Both queries join those 2 views on the column table_source and filter out those rows not satisfying table_source = 'ONE'. Both queries produce the same result set as the following.
+--------------+----+-----+----------------+-----+-----+ | table_source | c1 | c2 | table_source_a | c1a | c2a | +--------------+----+-----+----------------+-----+-----+ | ONE | 1 | one | ONE | 1 | one | +--------------+----+-----+----------------+-----+-----+
However, according to the query profile, Query 1 results in 3 scans on tables pt1, pta1, and pta2, respectively. On the other hand, Query 2 that incorporates the additional/redundant predicate "b.table_source_a = 'ONE'" only involves 2 scans on tables pt1 and pta1, respectively due to this seemingly redundant predicate on b.table_source_a.
Hence, it can be seen that the plan generated from Query 1 is sub-optimal since a table that cannot contain any row in the result set is still scanned, i.e., table pta2.