Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39031

NaN != NaN in pivot

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.0
    • None
    • SQL
    • None

    Description

      I know this is an odd corner case, but NaN != NaN in pivot which is inconsistent with other places in Spark.

       

      
      scala> Seq(Double.NaN, Double.NaN, 1.0, Double.NaN, 1.0, 1.0).toDF.groupBy("value").count.show()
      +-----+-----+                                                                   
      |value|count|
      +-----+-----+
      |  NaN|    3|
      |  1.0|    3|
      +-----+-----+
      
      
      scala> Seq(Double.NaN, Double.NaN, 1.0, Double.NaN, 1.0, 1.0).toDF.groupBy("value").pivot("value").count.show()
      +-----+----+----+
      |value| 1.0| NaN|
      +-----+----+----+
      |  NaN|null|null|
      |  1.0|   3|null|
      +-----+----+----+
      
      

       

      It looks like the issue is that in PivotFirst if the pivotColumn is an AtomicType a HashMap is used, but for other types a TreeMap is used with an interpretedOrdering. If we made DoubleType and FloatType use the TreeMap then the equality checks would be correct. But I am not able to really test it because if I try to pivot on an array or struct I get analysis exceptions.

       

      scala> Seq(Double.NaN, Double.NaN, 1.0, Double.NaN, 1.0, 1.0).toDF.selectExpr("value", "struct(value) as ar_value").groupBy("value").pivot("ar_value").count.show()
      java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema [1.0]
        at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:182)
        at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101)
      
      

       

       scala> Seq(Double.NaN, Double.NaN, 1.0, Double.NaN, 1.0, 1.0).toDF.selectExpr("value", "array(value) as ar_value").groupBy("value").pivot("ar_value").count.show()
      org.apache.spark.sql.AnalysisException: Invalid pivot value '[1.0]': value data type array<double> does not match pivot column data type array<double>
        at org.apache.spark.sql.errors.QueryCompilationErrors$.pivotValDataTypeMismatchError(QueryCompilationErrors.scala:85)
        at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$10.$anonfun$applyOrElse$21(Analyzer.scala:762)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: