[SPARK-32038] Regression in handling NaN values in COUNT(DISTINCT) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.1, 3.1.0
Component/s: SQL
Labels:
- correctness

Target Version/s:

3.0.1

Description

There seems to be a regression in Spark 3.0.0, with regard to how NaN values are normalized/handled in COUNT(DISTINCT ...). Here is an illustration:

case class Test( uid:String, score:Float)

val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)

val rows = Seq(

 Test("mithunr",  Float.NaN),	
 Test("mithunr",  POS_NAN_1),
 Test("mithunr",  POS_NAN_2),
 Test("abellina", 1.0f),
 Test("abellina", 2.0f)

).toDF.createOrReplaceTempView("mytable")

spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show

Here are the results under Spark 3.0.0:

Spark 3.0.0 (single aggregation)

+--------+---------------------+
|     uid|count(DISTINCT score)|
+--------+---------------------+
|abellina|                    2|
| mithunr|                    3|
+--------+---------------------+

Note that the count against mithunr is 3, accounting for each distinct value for NaN.
The right results are returned when another aggregation is added to the GBY:

Spark 3.0.0 (multiple aggregations)

scala> spark.sql(" select uid, count(distinct score), max(score) from mytable group by 1 order by 1 asc ").show
+--------+---------------------+----------+
|     uid|count(DISTINCT score)|max(score)|
+--------+---------------------+----------+
|abellina|                    2|       2.0|
| mithunr|                    1|       NaN|
+--------+---------------------+----------+

Also, note that Spark 2.4.6 normalizes the DISTINCT expression correctly:

Spark 2.4.6

scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show

+--------+---------------------+
|     uid|count(DISTINCT score)|
+--------+---------------------+
|abellina|                    2|
| mithunr|                    1|
+--------+---------------------+

Attachments

Issue Links

links to

[Github] Pull Request #28876 (viirya)

[Github] Pull Request #28919 (cloud-fan)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Mithun Radhakrishnan

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Jun/20 23:44

Updated:: 24/Jun/20 08:37

Resolved:: 22/Jun/20 11:59