[SPARK-10169] Evaluating AggregateFunction1 (old code path) may return wrong answers when grouping expressions are used as arguments of aggregate functions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.1.1, 1.2.2, 1.3.1, 1.4.1
Fix Version/s: 1.3.2, 1.4.2
Component/s: SQL
Labels:
- correctness

Target Version/s:

1.3.2, 1.4.2

Description

Before Spark 1.5, if an aggregate function use an grouping expression as input argument, the result of the query can be wrong. The reason is we are using transformUp when we do aggregate results rewriting (see https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154).

To reproduce the problem, you can use

import org.apache.spark.sql.functions._
sc.parallelize((1 to 1000), 50).map(i => Tuple1(i)).toDF("i").registerTempTable("t")
sqlContext.sql(""" 
select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
from t
where i % 10 = 5
group by i % 10""").explain()

== Physical Plan ==
Aggregate false, [PartialGroup#234], [PartialGroup#234 AS _c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234 = 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS _c2#227L]
 Exchange (HashPartitioning [PartialGroup#234], 200)
  Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191 % 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L]
   Project [_1#190 AS i#191]
    Filter ((_1#190 % 10) = 5)
     PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at ExistingRDD.scala:37

sqlContext.sql(""" 
select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
from t
where i % 10 = 5
group by i % 10""").show

_c0 _c1 _c2
5   50  100

In Spark 1.5, new aggregation code path does not have the problem. The old code path is fixed by https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6.

Attachments

Issue Links

links to

[Github] Pull Request #8379 (yhuai)

[Github] Pull Request #8380 (yhuai)

[Github] Pull Request #8381 (yhuai)

Activity

People

Assignee:: Yin Huai

Reporter:: Yin Huai

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Aug/15 03:48

Updated:: 31/Aug/16 21:20

Resolved:: 24/Aug/15 20:01