Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2093

Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate

    Details

      Description

      Problem
      IN subqueries may return wrong results if the left-hand side of the IN is a constant. For example,

      select * from alltypestiny t1
      where 10 not in (select sum(int_col) from alltypestiny)
      

      Impala generates a wrong plan (see below):

      +----------------------------------------------------------+
      | Explain String                                           |
      +----------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
      |                                                          |
      | 07:EXCHANGE [UNPARTITIONED]                              |
      | |                                                        |
      | 03:CROSS JOIN [BROADCAST]                                |
      | |                                                        |
      | |--06:EXCHANGE [BROADCAST]                               |
      | |  |                                                     |
      | |  05:AGGREGATE [FINALIZE]                               |
      | |  |  output: sum:merge(int_col)                         |
      | |  |  *having: 10 = sum(int_col) *                        |
      | |  |                                                     |
      | |  04:EXCHANGE [UNPARTITIONED]                           |
      | |  |                                                     |
      | |  02:AGGREGATE                                          |
      | |  |  output: sum(int_col)                               |
      | |  |                                                     |
      | |  01:SCAN HDFS [functional.alltypestiny t1]             |
      | |     partitions=4/4 files=4 size=460B                   |
      | |                                                        |
      | 00:SCAN HDFS [functional.alltypestiny t1]                |
      |    partitions=4/4 files=4 size=460B                      |
      +----------------------------------------------------------+
      

      Workaround
      As a workaround, wrap the constant into an inline view and reference the view column in the IN subquery. To fix the example above:

      select * from alltypestiny t1, (select 10 tmp) v
      where tmp not in (select sum(int_col) from alltypestiny);
      

      Query Gen Details

      Impala returns 0 rows; Postgres returns 8 rows.

      Impala Query:

      SELECT
      COALESCE(t1.year, t1.id) AS int_col,
      COALESCE(t1.tinyint_col, t1.tinyint_col) AS int_col_1,
      IF(t1.bool_col, t1.float_col, t1.float_col) AS float_col
      FROM alltypestiny t1
      WHERE
      (365.06) NOT IN (SELECT
      COALESCE(77, -188, COALESCE(-87, SUM(t1.id), -151)) AS int_col
      FROM alltypestiny t1
      WHERE
      (t1.tinyint_col) IN (t1.month, -306.35)
      HAVING
      (MIN(t1.int_col)) >= (606))
      

      Equivalent Postgres Query:

      SELECT
      COALESCE(t1.year, t1.id) AS int_col,
      COALESCE(t1.tinyint_col, t1.tinyint_col) AS int_col_1,
      CASE WHEN t1.bool_col THEN t1.float_col ELSE t1.float_col END AS float_col
      FROM alltypestiny t1
      WHERE
      (365.06) NOT IN (SELECT
      COALESCE(77, -188, COALESCE(-87, SUM(t1.id), -151)) AS int_col
      FROM alltypestiny t1
      WHERE
      (t1.tinyint_col) IN (t1.month, -306.35)
      HAVING
      (MIN(t1.int_col)) >= (606))
      

        Attachments

          Activity

            People

            • Assignee:
              dtsirogiannis Dimitris Tsirogiannis
              Reporter:
              tarasbob Taras Bobrovytsky
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: