[IMPALA-9281] Inferred predicates not assigned to scan nodes when views are involved - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.4.0
Fix Version/s: None
Component/s: Frontend
Labels:
None

Flags:

Important
Epic Color:
ghx-label-11

Description

When a query involves the join of views each created based on multiple tables, the inferred predicate(s) is(are) not assigned to the scan node(s). This issue is/seems related to https://issues.apache.org/jira/browse/IMPALA-4578#.

In the following a minimum example to reproduce the phenomenon.

CREATE TABLE default.pt1 (
   c1 INT,
   c2 STRING
) 
STORED AS PARQUET;
insert into pt1 values (1, 'one');

CREATE TABLE default.pt2 (
   c1 INT,
   c2 STRING
) 
STORED AS PARQUET;
insert into pt2 values (2, 'two');

CREATE TABLE default.pta1 (
   c1a INT, 
   c2a STRING
)
STORED AS PARQUET;
insert into pta1 values (1,'one');

CREATE TABLE default.pta2 (
   c1a INT, 
   c2a STRING
)
STORED AS PARQUET;
insert into pta2 values (2,'two');

CREATE VIEW myview_1_on_2_parquet_tables AS 
SELECT 'ONE' table_source, c1, c2 FROM `default`.pt1 
UNION ALL 
SELECT 'TWO' table_source, c1, c2 FROM `default`.pt2;

CREATE VIEW myview_2_on_2_parquet_tables AS  
SELECT 'ONE' table_source_a, c1a, c2a FROM `default`.pta1 
UNION ALL 
SELECT 'TWO' table_source_a, c1a, c2a FROM `default`.pta2;

For easy reference, the contents of tables pt1, pt2, pta1, pta2, and views myview_1_on_2_tables, myview_2_on_2_tables are also given as follows.

Contents of table pt1 afterwards:

+----+-----+
| c1 | c2  |
+----+-----+
| 1  | one |
+----+-----+

Contents of table pt2 afterwards:

+----+-----+
| c1 | c2  |
+----+-----+
| 2  | two |
+----+-----+

Contents of table pta1 afterwards:

+-----+-----+
| c1a | c2a |
+-----+-----+
| 1   | one |
+-----+-----+

Contents of table pta2 afterwards:

+-----+-----+
| c1a | c2a |
+-----+-----+
| 2   | two |
+-----+-----+

Contents in myview_1_on_2_parquet_tables (union of tables t1 and t2):

+--------------+----+-----+
| table_source | c1 | c2  |
+--------------+----+-----+
| ONE          | 1  | one |
| TWO          | 2  | two |
+--------------+----+-----+

Contents in myview_2_on_2_parquet_tables (union of tables ta1 and ta2):

+----------------+-----+-----+
| table_source_a | c1a | c2a |
+----------------+-----+-----+
| ONE            | 1   | one |
| TWO            | 2   | two |
+----------------+-----+-----+

After creating the related tables and views described above, we consider the following 2 queries.

Query 1:

select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.table_source = 'ONE' 
and a.table_source = b.table_source_a;

Query 2:

select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.table_source = 'ONE' 
and b.table_source_a = 'ONE' 
and a.table_source = b.table_source_a;

Both queries join those 2 views on the column table_source and filter out those rows not satisfying table_source = 'ONE'. Both queries produce the same result set as the following.

+--------------+----+-----+----------------+-----+-----+
| table_source | c1 | c2  | table_source_a | c1a | c2a |
+--------------+----+-----+----------------+-----+-----+
| ONE          | 1  | one | ONE            | 1   | one |
+--------------+----+-----+----------------+-----+-----+

However, according to the query profile, Query 1 results in 3 scans on tables pt1, pta1, and pta2, respectively. On the other hand, Query 2 that incorporates the additional/redundant predicate "b.table_source_a = 'ONE'" only involves 2 scans on tables pt1 and pta1, respectively due to this seemingly redundant predicate on b.table_source_a.

Hence, it can be seen that the plan generated from Query 1 is sub-optimal since a table that cannot contain any row in the result set is still scanned, i.e., table pta2.

Inferred predicates not assigned to scan nodes when views are involved

Details

Description

Attachments

Attachments

Activity

People

Dates