[SPARK-42335] Pass the comment option through to univocity if users set it explicitly in CSV dataSource - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0, 3.1.0, 3.2.0, 3.3.0
Fix Version/s: 3.5.0
Component/s: SQL
Labels:
None

Target Version/s:

3.4.0

Description

In PR https://github.com/apache/spark/pull/29516, in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input.

For codes:

Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test")

Before Spark 3.0，the content of output CSV files is shown as:

After this change, the content is shown as:

For users, they can't set comment option to '\u0000' to keep the behavior as before because the new added `isCommentSet` check logic as follows:

val isCommentSet = this.comment != '\u0000'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
    format.setComment(comment)
  }
  // other code
}

It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource.

After this change, the behavior as flows:

id	code	2.4 and before	3.0 and after	this update	remark
1	Seq("#abc", "\u0000def", "xyz").toDF() .write.option("comment", "\u0000").csv(path)	#abc def xyz	"#abc" def xyz	#abc "def" xyz	this update has a little bit difference with 3.0
2	Seq("#abc", "\u0000def", "xyz").toDF() .write.option("comment", "#").csv(path)	#abc def xyz	"#abc" def xyz	"#abc" def xyz	the same
3	Seq("#abc", "\u0000def", "xyz").toDF() .write.csv(path)	#abc def xyz	"#abc" def xyz	"#abc" def xyz	default behavior: the same
4	Seq("#abc", "\u0000def", "xyz").toDF().write.text(path) spark.read.option("comment", "\u0000").csv(path)	#abc xyz	#abc \u0000def xyz	#abc xyz	this update has a little bit difference with 3.0
5	Seq("#abc", "\u0000def", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)	\u0000def xyz	\u0000def xyz	\u0000def xyz	the same
6	Seq("#abc", "\u0000def", "xyz").toDF().write.text(path) spark.read.csv(path)	#abc xyz	#abc \u0000def xyz	#abc \u0000def xyz	default behavior: the same

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-02-03-18-56-01-596.png
03/Feb/23 10:56
5 kB
Wei Guo
image-2023-02-03-18-56-10-083.png
03/Feb/23 10:56
4 kB
Wei Guo

Issue Links

links to

[Github] Pull Request #39878 (wayneguow)

Activity

People

Assignee:: Wei Guo

Reporter:: Wei Guo

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Feb/23 10:55

Updated:: 08/Feb/23 21:12

Resolved:: 08/Feb/23 21:12