Description
Average values are difference between the calculation is done partially or not partially.
Because AverageFunction (in not-partially calculation) counts even if the evaluated value is null.
To reproduce this bug, run the following in sbt/sbt hive/console:
scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println) ... == Query Plan == Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] Exchange SinglePartition Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648] HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98 == Query Plan == Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644] Exchange SinglePartition Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648] HiveTableScan [key#646], (MetastoreRelation default, src1, None), None) ... [237.06666666666666] scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println) ... == Query Plan == Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669] Exchange SinglePartition HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98 == Query Plan == Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669] Exchange SinglePartition HiveTableScan [key#672], (MetastoreRelation default, src1, None), None) ... [142.24,15]
In the first query, AVG is broke into partial aggregation, and gives the right answer (null values ignored). In the second query, since COUNT(DISTINCT key) can't be turned into partial aggregation, AVG isn't either, and the bug is triggered.