Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3126

Incorrect assignment of an inner join On-clause predicate through an outer join.

    Details

      Description

      Impala may return incorrect results for queries that have the following properties:

      • There is an INNER JOIN following a series of OUTER JOINs
      • The INNER JOIN has an On-clause with a predicate that references at least two tables that are on the nullable side of the preceding OUTER JOINs

      Query to repro and its plan:

      select 1 from functional.alltypes a
      left outer join functional.alltypes b
        on a.id = b.id
      left outer join functional.alltypes c
        on b.id = c.id
      right outer join functional.alltypes d
        on c.id = d.id
      inner join functional.alltypes e
        on b.int_col = c.int_col
      
      +-----------------------------------------------------------+
      | Explain String                                            |
      +-----------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=480.04MB VCores=4 |
      |                                                           |
      | 14:EXCHANGE [UNPARTITIONED]                               |
      | |                                                         |
      | 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
      | |                                                         |
      | |--13:EXCHANGE [BROADCAST]                                |
      | |  |                                                      |
      | |  04:SCAN HDFS [functional.alltypes e]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |                                                         |
      | 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
      | |  hash predicates: c.id = d.id                           |
      | |  runtime filters: RF000 <- d.id                         |
      | |                                                         |
      | |--12:EXCHANGE [HASH(d.id)]                               |
      | |  |                                                      |
      | |  03:SCAN HDFS [functional.alltypes d]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |                                                         |
      | 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]               |
      | |  hash predicates: b.id = c.id                           |
      | |  other predicates: b.int_col = c.int_col     <--- incorrect placement; should be at node 07 or 08           |
      | |  runtime filters: RF001 <- c.int_col                    |
      | |                                                         |
      | |--11:EXCHANGE [HASH(c.id)]                               |
      | |  |                                                      |
      | |  02:SCAN HDFS [functional.alltypes c]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |     runtime filters: RF000 -> c.id                      |
      | |                                                         |
      | 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
      | |  hash predicates: b.id = a.id                           |
      | |  runtime filters: RF002 <- a.id                         |
      | |                                                         |
      | |--10:EXCHANGE [HASH(a.id)]                               |
      | |  |                                                      |
      | |  00:SCAN HDFS [functional.alltypes a]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |                                                         |
      | 09:EXCHANGE [HASH(b.id)]                                  |
      | |                                                         |
      | 01:SCAN HDFS [functional.alltypes b]                      |
      |    partitions=24/24 files=24 size=478.45KB                |
      |    runtime filters: RF001 -> b.int_col, RF002 -> b.id     |
      +-----------------------------------------------------------+
      

      Workaround
      For some queries, this problem can be worked around by placing the problematic On-clause predicate in the WHERE clause instead, or changing the preceding OUTER JOINS to INNER JOINS (if the On-clause predicate would discard NULLs).

      To fix the example query above:

      select 1 from functional.alltypes a
      left outer join functional.alltypes b
        on a.id = b.id
      left outer join functional.alltypes c
        on b.id = c.id
      right outer join functional.alltypes d
        on c.id = d.id
      inner join functional.alltypes e
      where b.int_col = c.int_col
      
      +-----------------------------------------------------------+
      | Explain String                                            |
      +-----------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=480.04MB VCores=4 |
      |                                                           |
      | 14:EXCHANGE [UNPARTITIONED]                               |
      | |                                                         |
      | 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
      | |                                                         |
      | |--13:EXCHANGE [BROADCAST]                                |
      | |  |                                                      |
      | |  04:SCAN HDFS [functional.alltypes e]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |                                                         |
      | 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
      | |  hash predicates: c.id = d.id                           |
      | |  other predicates: b.int_col = c.int_col          <-- correct assignment      |
      | |  runtime filters: RF000 <- d.id                         |
      | |                                                         |
      | |--12:EXCHANGE [HASH(d.id)]                               |
      | |  |                                                      |
      | |  03:SCAN HDFS [functional.alltypes d]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |                                                         |
      | 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]               |
      | |  hash predicates: b.id = c.id                           |
      | |                                                         |
      | |--11:EXCHANGE [HASH(c.id)]                               |
      | |  |                                                      |
      | |  02:SCAN HDFS [functional.alltypes c]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |     runtime filters: RF000 -> c.id                      |
      | |                                                         |
      | 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
      | |  hash predicates: b.id = a.id                           |
      | |  runtime filters: RF001 <- a.id                         |
      | |                                                         |
      | |--10:EXCHANGE [HASH(a.id)]                               |
      | |  |                                                      |
      | |  00:SCAN HDFS [functional.alltypes a]                   |
      | |     partitions=24/24 files=24 size=478.45KB             |
      | |                                                         |
      | 09:EXCHANGE [HASH(b.id)]                                  |
      | |                                                         |
      | 01:SCAN HDFS [functional.alltypes b]                      |
      |    partitions=24/24 files=24 size=478.45KB                |
      |    runtime filters: RF001 -> b.id                         |
      +-----------------------------------------------------------+
      

        Activity

        Hide
        alex.behm Alexander Behm added a comment -

        Not tracking this correctness issue as a blocker due to the low estimated likelihood of hitting it. Also, this issue is not a regression from a previous release and it has documented workarounds.

        Show
        alex.behm Alexander Behm added a comment - Not tracking this correctness issue as a blocker due to the low estimated likelihood of hitting it. Also, this issue is not a regression from a previous release and it has documented workarounds.
        Hide
        jrussell John Russell added a comment -

        Took all of the issue/repro/workaround text basically verbatim.

        Blanked out "doc text" field so this issue doesn't show up on my to-do list.

        Show
        jrussell John Russell added a comment - Took all of the issue/repro/workaround text basically verbatim. Blanked out "doc text" field so this issue doesn't show up on my to-do list.
        Hide
        alex.behm Alexander Behm added a comment -

        commit 80f85179f99ff36d6ecad65b2041b45015ffb716
        Author: Alex Behm <alex.behm@cloudera.com>
        Date: Mon Nov 7 14:15:45 2016 -0800

        IMPALA-3126: Conservative assignment of inner-join On-clause predicates.

        Implements the following conservative but correct policy for assigning
        predicates from the On-clause of an inner join:
        If the predicate references an outer-joined tuple, then evaluate it at
        the inner join that the On-clause belongs to.

        Cleans up Analyzer.canEvalPredicate().

        Change-Id: Idf45323ed9102ffb45c9d94a130ea3692286f215
        Reviewed-on: http://gerrit.cloudera.org:8080/4982
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        alex.behm Alexander Behm added a comment - commit 80f85179f99ff36d6ecad65b2041b45015ffb716 Author: Alex Behm <alex.behm@cloudera.com> Date: Mon Nov 7 14:15:45 2016 -0800 IMPALA-3126 : Conservative assignment of inner-join On-clause predicates. Implements the following conservative but correct policy for assigning predicates from the On-clause of an inner join: If the predicate references an outer-joined tuple, then evaluate it at the inner join that the On-clause belongs to. Cleans up Analyzer.canEvalPredicate(). Change-Id: Idf45323ed9102ffb45c9d94a130ea3692286f215 Reviewed-on: http://gerrit.cloudera.org:8080/4982 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            alex.behm Alexander Behm
            Reporter:
            alex.behm Alexander Behm
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development