Description
Spark SQL is unable to generate a correct result when the following query using rollup.
"select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, b with rollup"
Spark SQL generates a wrong result:
[2,4,6,3]
[2,null,null,1]
[1,null,null,1]
[null,null,null,0]
[1,2,3,3]
The table mytable is super simple, containing two rows and two columns:
testData = Seq((1, 2), (2, 4)).toDF("a", "b")
After turning off codegen, the query plan is like
== Parsed Logical Plan ==
'Rollup ['a,'b], unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS sumAB#20),unresolvedalias('GROUPING__ID)
'UnresolvedRelation `mytable`, None
== Analyzed Logical Plan ==
a: int, b: int, sumAB: bigint, GROUPING__ID: int
Aggregate a#2,b#3,grouping__id#23, a#2,b#3,sum(cast((a#2 + b#3) as bigint)) AS sumAB#20L,GROUPING__ID#23
Expand [0,1,3], a#2,b#3, grouping__id#23
Subquery mytable
Project _1#0 AS a#2,_2#1 AS b#3
LocalRelation _1#0,_2#1, [[1,2],[2,4]]
== Optimized Logical Plan ==
Aggregate a#2,b#3,grouping__id#23, a#2,b#3,sum(cast((a#2 + b#3) as bigint)) AS sumAB#20L,GROUPING__ID#23
Expand [0,1,3], a#2,b#3, grouping__id#23
LocalRelation a#2,b#3, [[1,2],[2,4]]
== Physical Plan ==
Aggregate false, a#2,b#3,grouping__id#23, a#2,b#3,sum(PartialSum#24L) AS sumAB#20L,grouping__id#23
Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
Aggregate true, a#2,b#3,grouping__id#23, a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L
Expand List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3), a#2,b#3,grouping__id#23
LocalTableScan a#2,b#3, [[1,2],[2,4]]
Below are my observations:
1. Generation of GROUP__ID looks OK.
2. The problem still exists no matter whether turning on/off CODEGEN
3. Rollup still works in a simple query when group-by columns have only one column. For example, "select b, sum(a), GROUPING__ID from mytable group by b with rollup"
4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. Unfortunately, they hide the bugs. Although the buckets passed, they just compare the results of SQL and Dataframe. This way is unable to capture the regression when both return the same wrong results.
5. The same problem also exists in cube. I have not started the investigation in cube, but I believe the root causes should be the same.
6. It looks like all the logical plans are correct.