Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.2.0
-
None
-
None
Description
When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote.
This piece of Python code demonstrates the issue
import csv with open('testfile.csv', 'w') as f: cw = csv.writer(f) cw.writerow(['a 2.5" drive', 'another column']) cw.writerow(['a "quoted" string', '"quoted"']) cw.writerow([1,2]) with open('testfile.csv') as f: print(f.read()) # "a 2.5"" drive",another column # "a ""quoted"" string","""quoted""" # 1,2 spark.read.csv('testfile.csv').collect() # [Row(_c0='"a 2.5"" drive"', _c1='another column'), # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), # Row(_c0='1', _c1='2')] # explicitly stating the escape character fixed the issue spark.read.option('escape', '"').csv('testfile.csv').collect() # [Row(_c0='a 2.5" drive', _c1='another column'), # Row(_c0='a "quoted" string', _c1='"quoted"'), # Row(_c0='1', _c1='2')]
The same applies to writes, where reading the file written by Spark may result in garbage.
df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly df.write.format("csv").save('testout.csv') with open('testout.csv/part-....csv') as f: cr = csv.reader(f) print(next(cr)) print(next(cr)) # ['a 2.5\\ drive"', 'another column'] # ['a \\quoted\\" string"', '\\quoted\\""']
The culprit is in CSVOptions.scala, where the default escape character is overridden.
While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default.
Attachments
Issue Links
- is duplicated by
-
SPARK-25251 Make spark-csv's `quote` and `escape` options conform to RFC 4180
-
- Resolved
-
-
SPARK-25086 Incorrect Default Value For "escape" For CSV Files
-
- Resolved
-