Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4303

Incorrect change from UNION ALL to UNION DISTINCT after subquery rewriting.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Frontend
    • Labels:

      Description

      For queries that have deeply nested unions and WHERE-clause subqueries, UNION ALL may sometimes be incorrectly changed to a UNION DISTINCT after subquery rewriting. This modification may lead to incorrect results.

      Example:

      select * from functional.alltypestiny
      where exists (select 1 from functional.alltypes where int_col < 10)
      union all
        (select * from functional.alltypestiny where year=2009 and month=1
         union all
         select * from functional.alltypestiny where year=2009 and month=2)
      
      PLAN-ROOT SINK
      |
      12:EXCHANGE [UNPARTITIONED]
      |
      06:UNION
      |
      |--07:SCAN HDFS [functional.alltypestiny]
      |     partitions=1/4 files=1 size=115B
      |
      11:AGGREGATE [FINALIZE]
      |  group by: id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month
      |
      10:EXCHANGE [HASH(id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col,year,month)]
      |
      05:AGGREGATE [STREAMING]
      |  group by: id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month
      |
      00:UNION
      |
      |--04:SCAN HDFS [functional.alltypestiny]
      |     partitions=1/4 files=1 size=115B
      |
      03:NESTED LOOP JOIN [LEFT SEMI JOIN, BROADCAST]
      |
      |--09:EXCHANGE [BROADCAST]
      |  |
      |  08:EXCHANGE [UNPARTITIONED]
      |  |  limit: 1
      |  |
      |  02:SCAN HDFS [functional.alltypes]
      |     partitions=24/24 files=24 size=478.45KB
      |     predicates: int_col < 10
      |     limit: 1
      |
      01:SCAN HDFS [functional.alltypestiny]
         partitions=4/4 files=4 size=460B
      

      Notice the aggregation above which performs the DISTINCT.

      Workaround
      Manually unnest the union as follows to get a correct plan and result:

      select * from functional.alltypestiny
      where exists (select 1 from functional.alltypes where int_col < 10)
      union all
      select * from functional.alltypestiny where year=2009 and month=1
      union all
      select * from functional.alltypestiny where year=2009 and month=2
      

        Attachments

          Activity

            People

            • Assignee:
              alex.behm Alexander Behm
              Reporter:
              alex.behm Alexander Behm
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: