[SPARK-11275] [SQL] Incorrect results when using rollup/cube - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.3.0, 1.4.0, 1.5.1
Fix Version/s: 1.6.0
Component/s: SQL
Labels:
None

Target Version/s:

1.6.0

Description

Spark SQL is unable to generate a correct result when the following query using rollup.
"select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, b with rollup"

Spark SQL generates a wrong result:
[2,4,6,3]
[2,null,null,1]
[1,null,null,1]
[null,null,null,0]
[1,2,3,3]

The table mytable is super simple, containing two rows and two columns:
testData = Seq((1, 2), (2, 4)).toDF("a", "b")

After turning off codegen, the query plan is like

== Parsed Logical Plan ==
'Rollup ['a,'b], unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS sumAB#20),unresolvedalias('GROUPING__ID)
'UnresolvedRelation `mytable`, None

== Analyzed Logical Plan ==
a: int, b: int, sumAB: bigint, GROUPING__ID: int
Aggregate a#2,b#3,grouping__id#23, a#2,b#3,sum(cast((a#2 + b#3) as bigint)) AS sumAB#20L,GROUPING__ID#23
Expand [0,1,3], a#2,b#3, grouping__id#23
Subquery mytable
Project _1#0 AS a#2,_2#1 AS b#3
LocalRelation _1#0,_2#1, [[1,2],[2,4]]

== Optimized Logical Plan ==
Aggregate a#2,b#3,grouping__id#23, a#2,b#3,sum(cast((a#2 + b#3) as bigint)) AS sumAB#20L,GROUPING__ID#23
Expand [0,1,3], a#2,b#3, grouping__id#23
LocalRelation a#2,b#3, [[1,2],[2,4]]

== Physical Plan ==
Aggregate false, a#2,b#3,grouping__id#23, a#2,b#3,sum(PartialSum#24L) AS sumAB#20L,grouping__id#23
Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
Aggregate true, a#2,b#3,grouping__id#23, a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L
Expand List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3), a#2,b#3,grouping__id#23
LocalTableScan a#2,b#3, [[1,2],[2,4]]

Below are my observations:

1. Generation of GROUP__ID looks OK.
2. The problem still exists no matter whether turning on/off CODEGEN
3. Rollup still works in a simple query when group-by columns have only one column. For example, "select b, sum(a), GROUPING__ID from mytable group by b with rollup"
4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. Unfortunately, they hide the bugs. Although the buckets passed, they just compare the results of SQL and Dataframe. This way is unable to capture the regression when both return the same wrong results.
5. The same problem also exists in cube. I have not started the investigation in cube, but I believe the root causes should be the same.
6. It looks like all the logical plans are correct.

Attachments

Issue Links

links to

[Github] Pull Request #9419 (gatorsmile)

[Github] Pull Request #9429 (aray)

[Github] Pull Request #9815 (aray)

Activity

People

Assignee:: Andrew Ray

Reporter:: Xiao Li

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 23/Oct/15 06:11

Updated:: 19/Nov/15 23:12

Resolved:: 19/Nov/15 23:11