Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46959

CSV reader reads data inconsistently depending on column position

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.4.1
    • None
    • Spark Core
    • None

    Description

      Reading the following CSV

      "a";"b";"c";"d"
      10;100,00;"Some;String";"ok"
      20;200,00;"";"still ok"
      30;300,00;"also ok";""
      40;400,00;"";"" 

      with these options

      spark.read
              .option("header","true")
              .option("sep",";")
              .option("encoding","ISO-8859-1")
              .option("lineSep","\r\n")
              .option("nullValue","")
              .option("quote",'"')
              .option("escape","") 

      results in the followin inconsistent dataframe

       

      a b c d
      10 100,00 Some;String ok
      20 200,00 <null> still ok
      30 300,00 also ok "
      40 400,00 <null> "

      As one can see, the quoted empty fields of the last column are not correctly read as null but instead contain a single double quote. It works for column c.

      If I recall correctly, this only happens when the "escape" option is set to an empty string. Not setting it to "" (defaults to "\") seems to not cause this bug.

      I observed this on databricks spark runtime 13.2 (think that is spark 3.4.1).

      Attachments

        Activity

          People

            Unassigned Unassigned
            martinitus Martin Rueckl
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: