[SPARK-25251] Make spark-csv's `quote` and `escape` options conform to RFC 4180 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.3.0, 2.3.1, 2.4.0, 3.0.0
Fix Version/s: None
Component/s: Input/Output
Labels:
None

Description

As described in RFC-4180, page 2 -

   7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote

That's what Excel does, for example, by default.

Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:

.option('quote', '"') 
.option('escape', '"')

This may explain that a comma character wasn't interpreted as it was inside a quoted column.

So this is request to make spark-csv reader RFC-4180 compatible in regards to default option values for `quote` and `escape` (make both equal to " ).

Since this is a backward-incompatible change, Spark 3.0 might be a good release for this change.

Some more background - https://stackoverflow.com/a/45138591/470583

Attachments

Issue Links

duplicates

SPARK-22236 CSV I/O: does not respect RFC 4180

Open

Activity

People

Assignee:: Unassigned

Reporter:: Ruslan Dautkhanov

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Aug/18 16:44

Updated:: 12/Dec/22 18:11

Resolved:: 28/Aug/18 04:58