Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32038

Regression in handling NaN values in COUNT(DISTINCT)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.0.0
    • 3.0.1, 3.1.0
    • SQL

    Description

      There seems to be a regression in Spark 3.0.0, with regard to how NaN values are normalized/handled in COUNT(DISTINCT ...). Here is an illustration:

      case class Test( uid:String, score:Float)
      
      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)
      
      val rows = Seq(
      
       Test("mithunr",  Float.NaN),	
       Test("mithunr",  POS_NAN_1),
       Test("mithunr",  POS_NAN_2),
       Test("abellina", 1.0f),
       Test("abellina", 2.0f)
      
      ).toDF.createOrReplaceTempView("mytable")
      
      spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show
      

      Here are the results under Spark 3.0.0:

      Spark 3.0.0 (single aggregation)
      +--------+---------------------+
      |     uid|count(DISTINCT score)|
      +--------+---------------------+
      |abellina|                    2|
      | mithunr|                    3|
      +--------+---------------------+
      

      Note that the count against mithunr is 3, accounting for each distinct value for NaN.
      The right results are returned when another aggregation is added to the GBY:

      Spark 3.0.0 (multiple aggregations)
      scala> spark.sql(" select uid, count(distinct score), max(score) from mytable group by 1 order by 1 asc ").show
      +--------+---------------------+----------+
      |     uid|count(DISTINCT score)|max(score)|
      +--------+---------------------+----------+
      |abellina|                    2|       2.0|
      | mithunr|                    1|       NaN|
      +--------+---------------------+----------+
      

      Also, note that Spark 2.4.6 normalizes the DISTINCT expression correctly:

      Spark 2.4.6
      scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show
      
      +--------+---------------------+
      |     uid|count(DISTINCT score)|
      +--------+---------------------+
      |abellina|                    2|
      | mithunr|                    1|
      +--------+---------------------+
      

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            mithun Mithun Radhakrishnan
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: