[SPARK-38167] CSV parsing error when using escape='"' - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.1
Fix Version/s: None
Component/s: PySpark, Spark Core
Labels:
Environment:

Pyspark on a single-node Databricks managed Spark 3.1.2 cluster.

Description

hi all,

When reading CSV files with Spark, I ran into a parsing bug.

The summary:

When

reading a comma separated, double-quote quoted CSV file using the csv reader options escape='"' and header=True,
with a row containing a quoted empty field
followed by a quoted field starting with a comma and followed by one or more characters

selecting columns from the dataframe at or after the field described in 3) gives incorrect and inconsistent results

In detail:

When I instruct Spark to read this CSV file:

col1,col2
"",",a"

using the CSV reader options escape='"' (unnecessary for the example, necessary for the files I'm processing) and header=True, I expect the following result:

spark.read.csv(path, escape='"', header=True).show()
 
+----+----+
|col1|col2|
+----+----+
|null|  ,a|
+----+----+

Spark does yield this result, so far so good. However, when I select col2 from the dataframe, Spark yields an incorrect result:

spark.read.csv(path, escape='"', header=True).select('col2').show()
 
+----+
|col2|
+----+
|  a"|
+----+

If you run this example with more columns in the file, and more commas in the field, e.g. ",,,,,,,a", the problem compounds, as Spark shifts many values to the right, causing unexpected and incorrect results. The inconsistency between both methods surprised me, as it implies the parsing is evaluated differently between both methods.

I expect the bug to be located in the quote-balancing and un-escaping methods of the csv parser, but I can't find where that code is located in the code base. I'd be happy to take a look at it if anyone can point me where it is.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Marnix van den Broek

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Feb/22 13:45

Updated:: 02/Feb/24 15:58