[DRILL-2242] Wrong result (more rows) when outer query groups by subset of columns that inner query groups by - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.7.0
Fix Version/s: None
Component/s: Query Planning & Optimization
Labels:
None

Description

The following query has a subquery that groups on 2 columns and outer query that queries on 1 of those columns. With slice_target = 1 to force exchanges, it produces incorrect result:

alter session set `planner.slice_target` = 1;
select count(*) from 
 (select l_partksy from
   (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
      group by l_partkey, l_suppkey) 
   group by l_partkey
 );

+------------+
|   EXPR$0   |
+------------+
| 6227       |
+------------+
1 row selected (1.522 seconds)

Correct result (from Postgres):

 count
-------
  2000
(1 row)

The cause appears to be related to distribution trait propagation. Here's the EXPLAIN plan:

+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
00-02        UnionExchange
01-01          StreamAgg(group=[{}], EXPR$0=[COUNT()])
01-02            Project($f0=[0])
01-03              HashAgg(group=[{0}])
01-04                Project(l_partkey=[$0])
01-05                  HashAgg(group=[{0, 1}])
01-06                    HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
02-01                      HashAgg(group=[{0, 1}])
02-02                        Project(l_partkey=[$1], l_suppkey=[$0])
02-03                          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]], selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`, `l_suppkey`]]])

Note that the HashExchange operator 06 does a distribute on 2 columns l_partkey and l_suppkey in order to perform the 2phase aggregation. These are the group-by columns. However, in the outer query's HashAgg, there is no re-distribution being done. It assumes that data is already hash distributed on l_partkey which is not correct.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0002-DRILL-2242-Propagate-distribution-trait-when-Project.patch
19/Feb/15 07:43
3 kB
Jinfeng Ni

Activity

People

Assignee:: Jinfeng Ni

Reporter:: Aman Sinha

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Feb/15 03:13

Updated:: 24/Feb/15 23:22

Resolved:: 24/Feb/15 23:22