Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10169

Evaluating AggregateFunction1 (old code path) may return wrong answers when grouping expressions are used as arguments of aggregate functions

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.1.1, 1.2.2, 1.3.1, 1.4.1
    • Fix Version/s: 1.3.2, 1.4.2
    • Component/s: SQL
    • Labels:

      Description

      Before Spark 1.5, if an aggregate function use an grouping expression as input argument, the result of the query can be wrong. The reason is we are using transformUp when we do aggregate results rewriting (see https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154).

      To reproduce the problem, you can use

      import org.apache.spark.sql.functions._
      sc.parallelize((1 to 1000), 50).map(i => Tuple1(i)).toDF("i").registerTempTable("t")
      sqlContext.sql(""" 
      select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
      from t
      where i % 10 = 5
      group by i % 10""").explain()
      
      == Physical Plan ==
      Aggregate false, [PartialGroup#234], [PartialGroup#234 AS _c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234 = 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS _c2#227L]
       Exchange (HashPartitioning [PartialGroup#234], 200)
        Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191 % 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L]
         Project [_1#190 AS i#191]
          Filter ((_1#190 % 10) = 5)
           PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at ExistingRDD.scala:37
      
      sqlContext.sql(""" 
      select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
      from t
      where i % 10 = 5
      group by i % 10""").show
      
      _c0 _c1 _c2
      5   50  100
      

      In Spark 1.5, new aggregation code path does not have the problem. The old code path is fixed by https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6.

        Attachments

          Activity

            People

            • Assignee:
              yhuai Yin Huai
              Reporter:
              yhuai Yin Huai
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: