Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1915

AverageFunction should not count if the evaluated value is null.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.1, 1.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Average values are difference between the calculation is done partially or not partially.

      Because AverageFunction (in not-partially calculation) counts even if the evaluated value is null.

      To reproduce this bug, run the following in sbt/sbt hive/console:

      scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
      ...
      == Query Plan ==
      Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
       Exchange SinglePartition
        Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
         HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
      14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
       Exchange SinglePartition
        Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
         HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
      ...
      [237.06666666666666]
      
      scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
      ...
      == Query Plan ==
      Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
       Exchange SinglePartition
        HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
      14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
       Exchange SinglePartition
        HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
      ...
      [142.24,15]
      

      In the first query, AVG is broke into partial aggregation, and gives the right answer (null values ignored). In the second query, since COUNT(DISTINCT key) can't be turned into partial aggregation, AVG isn't either, and the bug is triggered.

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              ueshin Takuya Ueshin

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment