Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
3.0.0
Description
There seems to be a regression in Spark 3.0.0, with regard to how NaN values are normalized/handled in COUNT(DISTINCT ...). Here is an illustration:
case class Test( uid:String, score:Float) val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001) val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff) val rows = Seq( Test("mithunr", Float.NaN), Test("mithunr", POS_NAN_1), Test("mithunr", POS_NAN_2), Test("abellina", 1.0f), Test("abellina", 2.0f) ).toDF.createOrReplaceTempView("mytable") spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show
Here are the results under Spark 3.0.0:
Spark 3.0.0 (single aggregation)
+--------+---------------------+ | uid|count(DISTINCT score)| +--------+---------------------+ |abellina| 2| | mithunr| 3| +--------+---------------------+
Note that the count against mithunr is 3, accounting for each distinct value for NaN.
The right results are returned when another aggregation is added to the GBY:
Spark 3.0.0 (multiple aggregations)
scala> spark.sql(" select uid, count(distinct score), max(score) from mytable group by 1 order by 1 asc ").show
+--------+---------------------+----------+
| uid|count(DISTINCT score)|max(score)|
+--------+---------------------+----------+
|abellina| 2| 2.0|
| mithunr| 1| NaN|
+--------+---------------------+----------+
Also, note that Spark 2.4.6 normalizes the DISTINCT expression correctly:
Spark 2.4.6
scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show
+--------+---------------------+
| uid|count(DISTINCT score)|
+--------+---------------------+
|abellina| 2|
| mithunr| 1|
+--------+---------------------+