Details
Description
In PR https://github.com/apache/spark/pull/29516, in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input.
For codes:
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test")
Before Spark 3.0,the content of output CSV files is shown as:
After this change, the content is shown as:
For users, they can't set comment option to '\u0000' to keep the behavior as before because the new added `isCommentSet` check logic as follows:
val isCommentSet = this.comment != '\u0000' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code }
It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource.
After this change, the behavior as flows:
id | code | 2.4 and before | 3.0 and after | this update | remark |
1 | Seq("#abc", "\u0000def", "xyz").toDF() .write.option("comment", "\u0000").csv(path) |
#abc def xyz |
"#abc" def xyz |
#abc "def" xyz |
this update has a little bit difference with 3.0 |
2 | Seq("#abc", "\u0000def", "xyz").toDF() .write.option("comment", "#").csv(path) |
#abc def xyz |
"#abc" def xyz |
"#abc" def xyz |
the same |
3 | Seq("#abc", "\u0000def", "xyz").toDF() .write.csv(path) |
#abc def xyz |
"#abc" def xyz |
"#abc" def xyz |
default behavior: the same |
4 | Seq("#abc", "\u0000def", "xyz").toDF().write.text(path) spark.read.option("comment", "\u0000").csv(path) |
#abc xyz |
#abc \u0000def xyz |
#abc xyz |
this update has a little bit difference with 3.0 |
5 | Seq("#abc", "\u0000def", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path) |
\u0000def xyz |
\u0000def xyz |
\u0000def xyz |
the same |
6 | Seq("#abc", "\u0000def", "xyz").toDF().write.text(path) spark.read.csv(path) |
#abc xyz |
#abc \u0000def xyz |
#abc \u0000def xyz |
default behavior: the same |