[SPARK-39842] Inconsistent parsing of unusually encoded CSV - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.2
Fix Version/s: None
Component/s: Input/Output
Labels:
None
Environment:

pyspark 3.3.2, local mode

Description

MCVE with PySpark code (but inherently in Spark itself, not PySpark):

spark = SparkSession.builder.appName("...").getOrCreate()

min_schema = StructType(
    [
        StructField("dummy_col", StringType(), True),
        StructField("record_id", IntegerType(), nullable=False),
        StructField("dummy_after", StringType(), nullable=False),
    ]
)


df = (
    spark.read.option("mode", "FAILFAST")
    .option("quote", '"')
    .option("escape", '"')
    .option("inferSchema", "false")
    .option("multiline", "true")
    .option("ignoreLeadingWhiteSpace", "true")
    .option("ignoreTrailingWhiteSpace", "true")
    .schema(min_schema)
    .csv(f'min_repro.csv', header=True)
)

min_repro.csv:

dummy_col,record_id,dummy_after
"",1,", Unusual value with comma included"
B,2,"Unusual value with escaped quote and comma ""like, this"

Spark behaves inconsistently

1. Collect works fine:

df.collect()

[Row(dummy_col=None, record_id=1, dummy_after=', Unusual value with comma included'),
Row(dummy_col='B', record_id=2, dummy_after='Unusual value with escaped quote and comma "like, this')]

2. Trivial query on same DF fails

if df.count() != df.select('record_id').distinct().count():
    pass

Error:

Py4JJavaError: An error occurred while calling o357.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 13, localhost, executor driver): org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST.
...
Caused by: java.lang.NumberFormatException: For input string: "Unusual value with comma included""
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

I understand that input file encoding is unusual, but Spark still shouldn't parse it differently between one or other dataframe method.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Łukasz Rogalski

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Jul/22 11:56

Updated:: 22/Jul/22 11:57