Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.1
-
None
-
Pyspark on a single-node Databricks managed Spark 3.1.2 cluster.
Description
hi all,
When reading CSV files with Spark, I ran into a parsing bug.
The summary:
When
- reading a comma separated, double-quote quoted CSV file using the csv reader options escape='"' and header=True,
- with a row containing a quoted empty field
- followed by a quoted field starting with a comma and followed by one or more characters
selecting columns from the dataframe at or after the field described in 3) gives incorrect and inconsistent results
In detail:
When I instruct Spark to read this CSV file:
col1,col2
"",",a"
using the CSV reader options escape='"' (unnecessary for the example, necessary for the files I'm processing) and header=True, I expect the following result:
spark.read.csv(path, escape='"', header=True).show() +----+----+ |col1|col2| +----+----+ |null| ,a| +----+----+
Spark does yield this result, so far so good. However, when I select col2 from the dataframe, Spark yields an incorrect result:
spark.read.csv(path, escape='"', header=True).select('col2').show() +----+ |col2| +----+ | a"| +----+
If you run this example with more columns in the file, and more commas in the field, e.g. ",,,,,,,a", the problem compounds, as Spark shifts many values to the right, causing unexpected and incorrect results. The inconsistency between both methods surprised me, as it implies the parsing is evaluated differently between both methods.
I expect the bug to be located in the quote-balancing and un-escaping methods of the csv parser, but I can't find where that code is located in the code base. I'd be happy to take a look at it if anyone can point me where it is.