[SPARK-22236] CSV I/O: does not respect RFC 4180 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: Input/Output
Labels:
None

Description

When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote.

This piece of Python code demonstrates the issue

import csv
with open('testfile.csv', 'w') as f:
    cw = csv.writer(f)
    cw.writerow(['a 2.5" drive', 'another column'])
    cw.writerow(['a "quoted" string', '"quoted"'])
    cw.writerow([1,2])

with open('testfile.csv') as f:
    print(f.read())

# "a 2.5"" drive",another column
# "a ""quoted"" string","""quoted"""
# 1,2

spark.read.csv('testfile.csv').collect()

# [Row(_c0='"a 2.5"" drive"', _c1='another column'),
#  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
#  Row(_c0='1', _c1='2')]

# explicitly stating the escape character fixed the issue
spark.read.option('escape', '"').csv('testfile.csv').collect()

# [Row(_c0='a 2.5" drive', _c1='another column'),
#  Row(_c0='a "quoted" string', _c1='"quoted"'),
#  Row(_c0='1', _c1='2')]

The same applies to writes, where reading the file written by Spark may result in garbage.

df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly
df.write.format("csv").save('testout.csv')
with open('testout.csv/part-....csv') as f:
    cr = csv.reader(f)
    print(next(cr))
    print(next(cr))

# ['a 2.5\\ drive"', 'another column']
# ['a \\quoted\\" string"', '\\quoted\\""']

The culprit is in CSVOptions.scala, where the default escape character is overridden.

While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default.

Attachments

Issue Links

is duplicated by

SPARK-25251 Make spark-csv's `quote` and `escape` options conform to RFC 4180

Resolved

SPARK-25086 Incorrect Default Value For "escape" For CSV Files

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ondrej Kokes

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 10/Oct/17 18:54

Updated:: 12/Dec/22 18:11