Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32038

Regression in handling NaN values in COUNT(DISTINCT)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.1, 3.1.0
    • Component/s: SQL
    • Labels:
    • Target Version/s:

      Description

      There seems to be a regression in Spark 3.0.0, with regard to how NaN values are normalized/handled in COUNT(DISTINCT ...). Here is an illustration:

      case class Test( uid:String, score:Float)
      
      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)
      
      val rows = Seq(
      
       Test("mithunr",  Float.NaN),	
       Test("mithunr",  POS_NAN_1),
       Test("mithunr",  POS_NAN_2),
       Test("abellina", 1.0f),
       Test("abellina", 2.0f)
      
      ).toDF.createOrReplaceTempView("mytable")
      
      spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show
      

      Here are the results under Spark 3.0.0:

      Spark 3.0.0 (single aggregation)
      +--------+---------------------+
      |     uid|count(DISTINCT score)|
      +--------+---------------------+
      |abellina|                    2|
      | mithunr|                    3|
      +--------+---------------------+
      

      Note that the count against mithunr is 3, accounting for each distinct value for NaN.
      The right results are returned when another aggregation is added to the GBY:

      Spark 3.0.0 (multiple aggregations)
      scala> spark.sql(" select uid, count(distinct score), max(score) from mytable group by 1 order by 1 asc ").show
      +--------+---------------------+----------+
      |     uid|count(DISTINCT score)|max(score)|
      +--------+---------------------+----------+
      |abellina|                    2|       2.0|
      | mithunr|                    1|       NaN|
      +--------+---------------------+----------+
      

      Also, note that Spark 2.4.6 normalizes the DISTINCT expression correctly:

      Spark 2.4.6
      scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show
      
      +--------+---------------------+
      |     uid|count(DISTINCT score)|
      +--------+---------------------+
      |abellina|                    2|
      | mithunr|                    1|
      +--------+---------------------+
      

        Attachments

          Activity

            People

            • Assignee:
              viirya L. C. Hsieh
              Reporter:
              mithun Mithun Radhakrishnan
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: