Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23835

When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.1, 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      I constructed a DataFrame with a nullable java.lang.Double column (and an extra Double column). I then converted it to a Dataset using ```as[(Double, Double)]```. When the Dataset is shown, it has a null. When it is collected and printed, the null is silently converted to a -1.

      Code snippet to reproduce this:

      val localSpark = spark
      import localSpark.implicits._
      val df = Seq[(java.lang.Double, Double)](
        (1.0, 2.0),
        (3.0, 4.0),
        (Double.NaN, 5.0),
        (null, 6.0)
      ).toDF("a", "b")
      df.show()  // OUTPUT 1: has null
      
      df.printSchema()
      val data = df.as[(Double, Double)]
      data.show()  // OUTPUT 2: has null
      data.collect().foreach(println)  // OUTPUT 3: has -1
      

      OUTPUT 1 and 2:

      +----+---+
      |   a|  b|
      +----+---+
      | 1.0|2.0|
      | 3.0|4.0|
      | NaN|5.0|
      |null|6.0|
      +----+---+
      

      OUTPUT 3:

      (1.0,2.0)
      (3.0,4.0)
      (NaN,5.0)
      (-1.0,6.0)
      

        Attachments

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              josephkb Joseph K. Bradley
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: