[SPARK-1915] AverageFunction should not count if the evaluated value is null. - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.1, 1.1.0
Component/s: SQL
Labels:
None

Description

Average values are difference between the calculation is done partially or not partially.

Because AverageFunction (in not-partially calculation) counts even if the evaluated value is null.

To reproduce this bug, run the following in sbt/sbt hive/console:

scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
...
[237.06666666666666]

scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
...
[142.24,15]

In the first query, AVG is broke into partial aggregation, and gives the right answer (null values ignored). In the second query, since COUNT(DISTINCT key) can't be turned into partial aggregation, AVG isn't either, and the bug is triggered.

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Takuya Ueshin

Reporter:: Takuya Ueshin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/May/14 15:23

Updated:: 28/May/14 03:59

Resolved:: 28/May/14 01:58

Agile

View on Board

AverageFunction should not count if the evaluated value is null.

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment