Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-20

Aggregate of a subquery result set returns wrong results if the subquery contains a 'limit' and data is distributed across multiple nodes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 0.3
    • Impala 0.7
    • None
    • None

    Description

      Aggregate of a subquery result set returns wrong results if the subquery contains a 'limit' clause and data is distributed across multiple nodes. From the query plan, it looks like we are just summing the results from each slave.

      Example, if the data spread across 3 nodes (expected result is 10):

      > select count(*) from (select * from tpch.lineitem limit 10) p
      Query finished, fetching results ...
      30
      Returned 1 row(s) in 0.08s
      

      Plan

       
        UNPARTITIONED
        AGGREGATE
        OUTPUT: SUM(<slot 32>)
        GROUP BY:
        TUPLE IDS: 2
          EXCHANGE (2)
            TUPLE IDS: 2
      
      Plan Fragment 1
        RANDOM
        STREAM DATA SINK
          EXCHANGE ID: 2
          UNPARTITIONED
      
        AGGREGATE
        OUTPUT: COUNT(*)
        GROUP BY:
        TUPLE IDS: 2
          SCAN HDFS table=tpch.lineitem #partitions=1 size=718.94MB (0)
            LIMIT: 10
            TUPLE IDS: 0
      

      Attachments

        Activity

          People

            marcelk Marcel Kinard
            lskuff Lenni Kuff
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: