Description
Problem statement
Empty string values were written to CSV as unquoted strings prior Spark version 2.4.0.
From version 2.4.0 empty string values end up as "" values in CSV files which is a problem if an application was expected to not wrap empty values in quotes (which is certainly the case if the CSV is intended to be used in Microsoft PowerBI for example as it doesn't handle CSV files with double quotes).
The following code ends up with the following results in the different versions of Spark:
Spark version | Code | Result |
---|---|---|
2.3.0 | val df = List("aa", "", "bb").toDF("name") df.coalesce(1).write.option("header", "true").csv("/23.csv") |
name aa bb |
2.4.0 | val df = List("aa", "", "bb").toDF("name") df.coalesce(1).write.option("header", "true").csv("/24.csv") |
name aa "" bb |
2.4.0 | val df = List("aa", "", "bb").toDF("name") df.coalesce(1).write.option("header", "true").option("quote", "").csv("/24-2.csv") |
name aa "" bb |
If the intention was to produce standard-looking CSV files (even though CSV standard doesn't exists) we still need a way to disable automatic quoting.
Also, using
option("quote", "\u0000")
had no effect; double-quotes were used still.
Proposed solution
Using the option
option("quote", "")
should disable quotes.