Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 0.3
-
None
-
None
Description
Aggregate of a subquery result set returns wrong results if the subquery contains a 'limit' clause and data is distributed across multiple nodes. From the query plan, it looks like we are just summing the results from each slave.
Example, if the data spread across 3 nodes (expected result is 10):
> select count(*) from (select * from tpch.lineitem limit 10) p Query finished, fetching results ... 30 Returned 1 row(s) in 0.08s
Plan
UNPARTITIONED AGGREGATE OUTPUT: SUM(<slot 32>) GROUP BY: TUPLE IDS: 2 EXCHANGE (2) TUPLE IDS: 2 Plan Fragment 1 RANDOM STREAM DATA SINK EXCHANGE ID: 2 UNPARTITIONED AGGREGATE OUTPUT: COUNT(*) GROUP BY: TUPLE IDS: 2 SCAN HDFS table=tpch.lineitem #partitions=1 size=718.94MB (0) LIMIT: 10 TUPLE IDS: 0