Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10755

Wrong results for a query with predicate on an analytic function

    XMLWordPrintableJSON

Details

    • ghx-label-12

    Description

      In the master branch, the following query with equality predicate on analytic function produces 13 rows whereas the correct result is 3 rows (as verified with Hive):

      use tpch;
      
      select *
        FROM (
          SELECT
            l_quantity, l_linenumber,
            nr_pvp,
            max(nr_pvp) over (partition BY l_quantity) max_nr_pvp
          FROM (
            SELECT
              l_quantity, l_linenumber,
              count(1) AS nr_pvp
            FROM
              lineitem, partsupp
              where l_partkey	= ps_partkey
              and ps_suppkey = 10 and  l_quantity in (1, 2)
              GROUP BY l_quantity, l_linenumber
          ) a
        ) b
        WHERE nr_pvp = max_nr_pvp
        order by 1, 2, 3, 4;
      

      Wrong results:

      +------------+--------------+--------+------------+                                                      
      | l_quantity | l_linenumber | nr_pvp | max_nr_pvp |
      +------------+--------------+--------+------------+
      | 1.00       | 1            | 11     | 12         |
      | 1.00       | 2            | 12     | 12         |
      | 1.00       | 3            | 10     | 12         |
      | 1.00       | 4            | 5      | 12         |
      | 1.00       | 5            | 4      | 12         |
      | 1.00       | 6            | 5      | 12         |
      | 1.00       | 7            | 2      | 12         |
      | 2.00       | 1            | 12     | 12         |
      | 2.00       | 2            | 12     | 12         |
      | 2.00       | 3            | 11     | 12         |
      | 2.00       | 4            | 8      | 12         |
      | 2.00       | 5            | 3      | 12         |
      | 2.00       | 6            | 2      | 12         |
      +------------+--------------+--------+------------+
      Fetched 13 row(s) in 1.36s
      

      Expected results (verified with Hive):

      +---------------+-----------------+-----------+---------------+
      | b.l_quantity  | b.l_linenumber  | b.nr_pvp  | b.max_nr_pvp  |
      +---------------+-----------------+-----------+---------------+
      | 1.00          | 2               | 12        | 12            |
      | 2.00          | 1               | 12        | 12            |
      | 2.00          | 2               | 12        | 12            |
      +---------------+-----------------+-----------+---------------+
      

      Explain plan for Impala shows that the equality predicate 'nr_pvp = max_nr_pvp' which is supposed to be applied on top of the analytic operator is missing:

      +---------------------------------------------------------------------------+
      | Explain String                                                            |
      +---------------------------------------------------------------------------+
      | Max Per-Host Resource Reservation: Memory=42.88MB Threads=7               |
      | Per-Host Resource Estimates: Memory=483MB                                 |
      |                                                                           |
      | PLAN-ROOT SINK                                                            |
      | |                                                                         |
      | 11:MERGING-EXCHANGE [UNPARTITIONED]                                       |
      | |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
      | |                                                                         |
      | 06:SORT                                                                   |
      | |  order by: l_quantity ASC, l_linenumber ASC, nr_pvp ASC, max_nr_pvp ASC |
      | |  row-size=28B cardinality=357                                           |
      | |                                                                         |
      | 05:ANALYTIC                                                               |
      | |  functions: max(count(*))                                               |
      | |  partition by: l_quantity                                               |
      | |  row-size=28B cardinality=357                                           |
      | |                                                                         |
      | 04:SORT                                                                   |
      | |  order by: l_quantity ASC NULLS LAST                                    |
      | |  row-size=20B cardinality=357                                           |
      | |                                                                         |
      | 10:EXCHANGE [HASH(l_quantity)]                                            |
      | |                                                                         |
      | 09:AGGREGATE [FINALIZE]                                                   |
      | |  output: count:merge(*)                                                 |
      | |  group by: l_quantity, l_linenumber                                     |
      | |  row-size=20B cardinality=357                                           |
      | |                                                                         |
      | 08:EXCHANGE [HASH(l_quantity,l_linenumber)]                               |
      | |                                                                         |
      | 03:AGGREGATE [STREAMING]                                                  |
      | |  output: count(*)                                                       |
      | |  group by: l_quantity, l_linenumber                                     |
      | |  row-size=20B cardinality=357                                           |
      | |                                                                         |
      | 02:HASH JOIN [INNER JOIN, BROADCAST]                                      |
      | |  hash predicates: l_partkey = ps_partkey                                |
      | |  runtime filters: RF000 <- ps_partkey                                   |
      | |  row-size=36B cardinality=2.45K                                         |
      | |                                                                         |
      | |--07:EXCHANGE [BROADCAST]                                                |
      | |  |                                                                      |
      | |  01:SCAN HDFS [tpch.partsupp]                                           |
      | |     HDFS partitions=1/1 files=1 size=112.71MB                           |
      | |     predicates: ps_suppkey = 10                                         |
      | |     row-size=16B cardinality=82                                         |
      | |                                                                         |
      | 00:SCAN HDFS [tpch.lineitem]                                              |
      |    HDFS partitions=1/1 files=1 size=718.94MB                              |
      |    predicates: l_quantity IN (1, 2)                                       |
      |    runtime filters: RF000 -> l_partkey                                    |
      |    row-size=20B cardinality=235.34K                                       |
      +---------------------------------------------------------------------------+
      

      This is likely a regression since based on some internal testing this was working correctly on an earlier version.

      Attachments

        Activity

          People

            amansinha Aman Sinha
            amansinha Aman Sinha
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: