Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23835

When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.1, 2.4.0
    • SQL
    • None

    Description

      I constructed a DataFrame with a nullable java.lang.Double column (and an extra Double column). I then converted it to a Dataset using ```as[(Double, Double)]```. When the Dataset is shown, it has a null. When it is collected and printed, the null is silently converted to a -1.

      Code snippet to reproduce this:

      val localSpark = spark
      import localSpark.implicits._
      val df = Seq[(java.lang.Double, Double)](
        (1.0, 2.0),
        (3.0, 4.0),
        (Double.NaN, 5.0),
        (null, 6.0)
      ).toDF("a", "b")
      df.show()  // OUTPUT 1: has null
      
      df.printSchema()
      val data = df.as[(Double, Double)]
      data.show()  // OUTPUT 2: has null
      data.collect().foreach(println)  // OUTPUT 3: has -1
      

      OUTPUT 1 and 2:

      +----+---+
      |   a|  b|
      +----+---+
      | 1.0|2.0|
      | 3.0|4.0|
      | NaN|5.0|
      |null|6.0|
      +----+---+
      

      OUTPUT 3:

      (1.0,2.0)
      (3.0,4.0)
      (NaN,5.0)
      (-1.0,6.0)
      

      Attachments

        Activity

          People

            mgaido Marco Gaido
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: