Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25739

Double quote coming in as empty value even when emptyValue set as null

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None
    • Environment:

       Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 2.11) 

    • Flags:
      Important

      Description

       Example code - 

      val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
      df
      .repartition(1)
      .write
      .mode("overwrite")
      .option("nullValue", null)
      .option("emptyValue", null)
      .option("delimiter",",")
      .option("quoteMode", "NONE")
      .option("escape","\\")
      .format("csv")
      .save("/tmp/nullcsv/")
      
      var out = dbutils.fs.ls("/tmp/nullcsv/")
      var file = out(out.size - 1)
      val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
      println(x)
      

      Output - 

      1,""
      3,hi
      2,hello
      4,
      

      Expected output - 

      1,
      3,hi
      2,hello
      4,
      

       

      https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe This commit is relevant to my issue.

      "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files."

      I am on Spark version 2.3.1, so empty strings should be coming as null.  Even then, I am passing the correct "emptyValue" option.  However, my empty values are stilling coming as `""` in the written file.

       

      I have tested the provided code in Databricks runtime environment 5.0 and 4.1, and it is giving the expected output.   However in Databricks runtime 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.

        Attachments

        Issue Links

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment