Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4303

Incorrect change from UNION ALL to UNION DISTINCT after subquery rewriting.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Frontend
    • Labels:

      Description

      For queries that have deeply nested unions and WHERE-clause subqueries, UNION ALL may sometimes be incorrectly changed to a UNION DISTINCT after subquery rewriting. This modification may lead to incorrect results.

      Example:

      select * from functional.alltypestiny
      where exists (select 1 from functional.alltypes where int_col < 10)
      union all
        (select * from functional.alltypestiny where year=2009 and month=1
         union all
         select * from functional.alltypestiny where year=2009 and month=2)
      
      PLAN-ROOT SINK
      |
      12:EXCHANGE [UNPARTITIONED]
      |
      06:UNION
      |
      |--07:SCAN HDFS [functional.alltypestiny]
      |     partitions=1/4 files=1 size=115B
      |
      11:AGGREGATE [FINALIZE]
      |  group by: id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month
      |
      10:EXCHANGE [HASH(id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col,year,month)]
      |
      05:AGGREGATE [STREAMING]
      |  group by: id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month
      |
      00:UNION
      |
      |--04:SCAN HDFS [functional.alltypestiny]
      |     partitions=1/4 files=1 size=115B
      |
      03:NESTED LOOP JOIN [LEFT SEMI JOIN, BROADCAST]
      |
      |--09:EXCHANGE [BROADCAST]
      |  |
      |  08:EXCHANGE [UNPARTITIONED]
      |  |  limit: 1
      |  |
      |  02:SCAN HDFS [functional.alltypes]
      |     partitions=24/24 files=24 size=478.45KB
      |     predicates: int_col < 10
      |     limit: 1
      |
      01:SCAN HDFS [functional.alltypestiny]
         partitions=4/4 files=4 size=460B
      

      Notice the aggregation above which performs the DISTINCT.

      Workaround
      Manually unnest the union as follows to get a correct plan and result:

      select * from functional.alltypestiny
      where exists (select 1 from functional.alltypes where int_col < 10)
      union all
      select * from functional.alltypestiny where year=2009 and month=1
      union all
      select * from functional.alltypestiny where year=2009 and month=2
      

        Activity

        Hide
        alex.behm Alexander Behm added a comment -

        commit 852e272b32a5424e1ea31af7b3d532dac0f916b3
        Author: Alex Behm <alex.behm@cloudera.com>
        Date: Fri Nov 4 17:52:47 2016 -0700

        IMPALA-4303: Do not reset() qualifier of union operands.

        The bug: We used to reset() the qualifier of union operands
        to their original value obtained during parsing. This leads to
        problems when union operands are unnested and we need to rewrite
        Subqueries. In particular, the first union operand of a nested union
        was reset() to a null qualifier, but that operand could be somewhere
        in the middle of the list of unnested operands in the parent. At that
        point, we've lost information about the qualifier of the unnested
        operand.

        The fix: The simplest solution is to not reset() the qualifier.
        The other alternative is be to reset() the qualifier, but also
        undo any unnesting. That seems unnecessary and wasteful.

        Change-Id: I157bb0f08c4a94fd779487d7c23edd64a537a1f6
        Reviewed-on: http://gerrit.cloudera.org:8080/4963
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        alex.behm Alexander Behm added a comment - commit 852e272b32a5424e1ea31af7b3d532dac0f916b3 Author: Alex Behm <alex.behm@cloudera.com> Date: Fri Nov 4 17:52:47 2016 -0700 IMPALA-4303 : Do not reset() qualifier of union operands. The bug: We used to reset() the qualifier of union operands to their original value obtained during parsing. This leads to problems when union operands are unnested and we need to rewrite Subqueries. In particular, the first union operand of a nested union was reset() to a null qualifier, but that operand could be somewhere in the middle of the list of unnested operands in the parent. At that point, we've lost information about the qualifier of the unnested operand. The fix: The simplest solution is to not reset() the qualifier. The other alternative is be to reset() the qualifier, but also undo any unnesting. That seems unnecessary and wasteful. Change-Id: I157bb0f08c4a94fd779487d7c23edd64a537a1f6 Reviewed-on: http://gerrit.cloudera.org:8080/4963 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            alex.behm Alexander Behm
            Reporter:
            alex.behm Alexander Behm
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development