Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.2
-
None
-
None
-
pyspark 3.3.2, local mode
Description
MCVE with PySpark code (but inherently in Spark itself, not PySpark):
spark = SparkSession.builder.appName("...").getOrCreate() min_schema = StructType( [ StructField("dummy_col", StringType(), True), StructField("record_id", IntegerType(), nullable=False), StructField("dummy_after", StringType(), nullable=False), ] ) df = ( spark.read.option("mode", "FAILFAST") .option("quote", '"') .option("escape", '"') .option("inferSchema", "false") .option("multiline", "true") .option("ignoreLeadingWhiteSpace", "true") .option("ignoreTrailingWhiteSpace", "true") .schema(min_schema) .csv(f'min_repro.csv', header=True) )
min_repro.csv:
dummy_col,record_id,dummy_after "",1,", Unusual value with comma included" B,2,"Unusual value with escaped quote and comma ""like, this"
Spark behaves inconsistently
1. Collect works fine:
df.collect() [Row(dummy_col=None, record_id=1, dummy_after=', Unusual value with comma included'), Row(dummy_col='B', record_id=2, dummy_after='Unusual value with escaped quote and comma "like, this')]
2. Trivial query on same DF fails
if df.count() != df.select('record_id').distinct().count(): pass
Error:
Py4JJavaError: An error occurred while calling o357.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 13, localhost, executor driver): org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. ... Caused by: java.lang.NumberFormatException: For input string: "Unusual value with comma included"" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
I understand that input file encoding is unusual, but Spark still shouldn't parse it differently between one or other dataframe method.