Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2018

Where clause does not propagate to joins inside nested views

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      We have run into a situation where nested views will cause results to be inconsistent with the results of running the queries in the views directly.

      To reproduce...
      In Impala - Create tables and some data

      > use u_alanj;
      > create table names(id string, name string);
      > insert into names(id, name) values("A", "Customer A");
      > insert into names(id, name) values("B", "Customer B");
      > create table vals(id string, value integer);
      

      In Bash - Create more data

      $ for x in A B C; do for y in 1 2 3; do echo "insert into vals(id, value) values('$x', $y);"; done; done > createData.sql
      $ impala-shell -f createData.sql
      

      In Impala - Create views and demonstrate bad query

      > use u_alanj;
      > create view test_join as select v.id, v.value, n.name from u_alanj.vals v left outer join u_alanj.names n on v.id = n.id;
      > create view test_join2 as select j.name, sum(j.value) as total from test_join j group by j.name;
      > select * from test_join2 where name='Customer A';
      Query: select * from test_join2 where name='Customer A'
      +------------+-------+
      | name       | total |
      +------------+-------+
      | Customer A | 6     |
      | NULL       | 12    |
      +------------+-------+
      Fetched 2 row(s) in 1.30s
      > select j.name, sum(j.value) as total from test_join j where j.name = 'Customer A' group by j.name;
      Query: select j.name, sum(j.value) as total from test_join j where j.name = 'Customer A' group by j.name
      +------------+-------+
      | name       | total |
      +------------+-------+
      | Customer A | 6     |
      +------------+-------+
      Fetched 1 row(s) in 0.15s
      

      In the version using the two views, we get null results.

      Broken Explain Results

      +------------------------------------------------------------------------------------+
      | Explain String                                                                     |
      +------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=2.16GB VCores=2                            |
      | WARNING: The following tables are missing relevant table and/or column statistics. |
      | u_alanj.names, u_alanj.vals                                                        |
      |                                                                                    |
      | 07:EXCHANGE [UNPARTITIONED]                                                        |
      | |                                                                                  |
      | 06:AGGREGATE [FINALIZE]                                                            |
      | |  output: sum:merge(j.value)                                                      |
      | |  group by: j.name                                                                |
      | |                                                                                  |
      | 05:EXCHANGE [HASH(j.name)]                                                         |
      | |                                                                                  |
      | 03:AGGREGATE                                                                       |
      | |  output: sum(v.value)                                                            |
      | |  group by: n.name                                                                |
      | |                                                                                  |
      | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                                          |
      | |  hash predicates: v.id = n.id                                                    |
      | |                                                                                  |
      | |--04:EXCHANGE [BROADCAST]                                                         |
      | |  |                                                                               |
      | |  01:SCAN HDFS [u_alanj.names n]                                                  |
      | |     partitions=1/1 files=2 size=26B                                              |
      | |     predicates: n.name = 'Customer A'                                            |
      | |                                                                                  |
      | 00:SCAN HDFS [u_alanj.vals v]                                                      |
      |    partitions=1/1 files=9 size=36B                                                 |
      +------------------------------------------------------------------------------------+
      Fetched 27 row(s) in 0.05s
      
      

      Working Explain Results

      +------------------------------------------------------------------------------------+
      | Explain String                                                                     |
      +------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=2.16GB VCores=2                            |
      | WARNING: The following tables are missing relevant table and/or column statistics. |
      | u_alanj.names, u_alanj.vals                                                        |
      |                                                                                    |
      | 07:EXCHANGE [UNPARTITIONED]                                                        |
      | |                                                                                  |
      | 06:AGGREGATE [FINALIZE]                                                            |
      | |  output: sum:merge(j.value)                                                      |
      | |  group by: j.name                                                                |
      | |                                                                                  |
      | 05:EXCHANGE [HASH(j.name)]                                                         |
      | |                                                                                  |
      | 03:AGGREGATE                                                                       |
      | |  output: sum(v.value)                                                            |
      | |  group by: n.name                                                                |
      | |                                                                                  |
      | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                                          |
      | |  hash predicates: v.id = n.id                                                    |
      | |  other predicates: n.name = 'Customer A'                                         |
      | |                                                                                  |
      | |--04:EXCHANGE [BROADCAST]                                                         |
      | |  |                                                                               |
      | |  01:SCAN HDFS [u_alanj.names n]                                                  |
      | |     partitions=1/1 files=2 size=26B                                              |
      | |     predicates: n.name = 'Customer A'                                            |
      | |                                                                                  |
      | 00:SCAN HDFS [u_alanj.vals v]                                                      |
      |    partitions=1/1 files=9 size=36B                                                 |
      +------------------------------------------------------------------------------------+
      Fetched 28 row(s) in 0.03s
      

      The important difference there is that in the working version, "other predicates: n.name = 'Customer A'" shows up in the hash join. I believe that impala isn't pushing the predicates into the join the way it needs to to give proper results in this situation.

      The view works consistently in hive.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dtsirogiannis Dimitris Tsirogiannis
            alanj_impala_5a78 Alan Jackoway
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment