Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1915

AverageFunction should not count if the evaluated value is null.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.1, 1.1.0
    • SQL
    • None

    Description

      Average values are difference between the calculation is done partially or not partially.

      Because AverageFunction (in not-partially calculation) counts even if the evaluated value is null.

      To reproduce this bug, run the following in sbt/sbt hive/console:

      scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
      ...
      == Query Plan ==
      Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
       Exchange SinglePartition
        Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
         HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
      14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
       Exchange SinglePartition
        Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
         HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
      ...
      [237.06666666666666]
      
      scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
      ...
      == Query Plan ==
      Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
       Exchange SinglePartition
        HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
      14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
       Exchange SinglePartition
        HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
      ...
      [142.24,15]
      

      In the first query, AVG is broke into partial aggregation, and gives the right answer (null values ignored). In the second query, since COUNT(DISTINCT key) can't be turned into partial aggregation, AVG isn't either, and the bug is triggered.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ueshin Takuya Ueshin
            ueshin Takuya Ueshin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment