Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9281

Inferred predicates not assigned to scan nodes when views are involved

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 3.4.0
    • Fix Version/s: None
    • Component/s: Frontend
    • Labels:
      None
    • Flags:
      Important
    • Epic Color:
      ghx-label-11

      Description

      When a query involves the join of views each created based on multiple tables, the inferred predicate(s) is(are) not assigned to the scan node(s). This issue is/seems related to https://issues.apache.org/jira/browse/IMPALA-4578#.

      In the following a minimum example to reproduce the phenomenon.

      CREATE TABLE default.pt1 (
         c1 INT,
         c2 STRING
      ) 
      STORED AS PARQUET;
      insert into pt1 values (1, 'one');
      
      CREATE TABLE default.pt2 (
         c1 INT,
         c2 STRING
      ) 
      STORED AS PARQUET;
      insert into pt2 values (2, 'two');
      
      CREATE TABLE default.pta1 (
         c1a INT, 
         c2a STRING
      )
      STORED AS PARQUET;
      insert into pta1 values (1,'one');
      
      CREATE TABLE default.pta2 (
         c1a INT, 
         c2a STRING
      )
      STORED AS PARQUET;
      insert into pta2 values (2,'two');
      
      CREATE VIEW myview_1_on_2_parquet_tables AS 
      SELECT 'ONE' table_source, c1, c2 FROM `default`.pt1 
      UNION ALL 
      SELECT 'TWO' table_source, c1, c2 FROM `default`.pt2;
      
      CREATE VIEW myview_2_on_2_parquet_tables AS  
      SELECT 'ONE' table_source_a, c1a, c2a FROM `default`.pta1 
      UNION ALL 
      SELECT 'TWO' table_source_a, c1a, c2a FROM `default`.pta2;
      

      For easy reference, the contents of tables pt1, pt2, pta1, pta2, and views myview_1_on_2_tables, myview_2_on_2_tables are also given as follows.

      Contents of table pt1 afterwards:

      +----+-----+
      | c1 | c2  |
      +----+-----+
      | 1  | one |
      +----+-----+
      

      Contents of table pt2 afterwards:

      +----+-----+
      | c1 | c2  |
      +----+-----+
      | 2  | two |
      +----+-----+
      

      Contents of table pta1 afterwards:

      +-----+-----+
      | c1a | c2a |
      +-----+-----+
      | 1   | one |
      +-----+-----+
      

      Contents of table pta2 afterwards:

      +-----+-----+
      | c1a | c2a |
      +-----+-----+
      | 2   | two |
      +-----+-----+
      

      Contents in myview_1_on_2_parquet_tables (union of tables t1 and t2):

      +--------------+----+-----+
      | table_source | c1 | c2  |
      +--------------+----+-----+
      | ONE          | 1  | one |
      | TWO          | 2  | two |
      +--------------+----+-----+
      

      Contents in myview_2_on_2_parquet_tables (union of tables ta1 and ta2):

      +----------------+-----+-----+
      | table_source_a | c1a | c2a |
      +----------------+-----+-----+
      | ONE            | 1   | one |
      | TWO            | 2   | two |
      +----------------+-----+-----+
      

      After creating the related tables and views described above, we consider the following 2 queries.

      Query 1:

      select * 
      from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
      where a.table_source = 'ONE' 
      and a.table_source = b.table_source_a;
      

      Query 2:

      select * 
      from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
      where a.table_source = 'ONE' 
      and b.table_source_a = 'ONE' 
      and a.table_source = b.table_source_a;
      

      Both queries join those 2 views on the column table_source and filter out those rows not satisfying table_source = 'ONE'. Both queries produce the same result set as the following.

      +--------------+----+-----+----------------+-----+-----+
      | table_source | c1 | c2  | table_source_a | c1a | c2a |
      +--------------+----+-----+----------------+-----+-----+
      | ONE          | 1  | one | ONE            | 1   | one |
      +--------------+----+-----+----------------+-----+-----+
      

      However, according to the query profile, Query 1 results in 3 scans on tables pt1, pta1, and pta2, respectively. On the other hand, Query 2 that incorporates the additional/redundant predicate "b.table_source_a = 'ONE'" only involves 2 scans on tables pt1 and pta1, respectively due to this seemingly redundant predicate on b.table_source_a.

      Hence, it can be seen that the plan generated from Query 1 is sub-optimal since a table that cannot contain any row in the result set is still scanned, i.e., table pta2.

        Attachments

        1. profile_query_1_parquet.txt
          62 kB
          Fang-Yu Rao
        2. profile_query_2_parquet.txt
          54 kB
          Fang-Yu Rao

          Activity

            People

            • Assignee:
              fangyurao Fang-Yu Rao
              Reporter:
              fangyurao Fang-Yu Rao
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: