Description
The spark sql csv reader is not dropping malformed records as expected.
Consider this file (fruit.csv). Notice it contains a header record, 3 valid records, and one malformed record.
fruit,color,price,quantity apple,red,1,3 banana,yellow,2,4 orange,orange,3,5 xxx
If I read this file using the spark sql csv reader as follows, everything looks good. The malformed record is dropped.
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").show(truncate=false) +------+------+-----+--------+ |fruit |color |price|quantity| +------+------+-----+--------+ |apple |red |1 |3 | |banana|yellow|2 |4 | |orange|orange|3 |5 | +------+------+-----+--------+
However, if I select a subset of the columns, the malformed record is not dropped. The malformed data is placed in the first column, and the remaining column(s) are filled with nulls.
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false) +------+ |fruit | +------+ |apple | |banana| |orange| |xxx | +------+ scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false) +------+------+ |fruit |color | +------+------+ |apple |red | |banana|yellow| |orange|orange| |xxx |null | +------+------+ scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price).show(truncate=false) +------+------+-----+ |fruit |color |price| +------+------+-----+ |apple |red |1 | |banana|yellow|2 | |orange|orange|3 | |xxx |null |null | +------+------+-----+
And finally, if I manually select all of the columns, the malformed record is once again dropped.
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 'quantity).show(truncate=false) +------+------+-----+--------+ |fruit |color |price|quantity| +------+------+-----+--------+ |apple |red |1 |3 | |banana|yellow|2 |4 | |orange|orange|3 |5 | +------+------+-----+--------+
I would expect the malformed record(s) to be dropped regardless of which columns are being selected from the file.
Attachments
Issue Links
- is duplicated by
-
SPARK-28079 CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
- Resolved
-
SPARK-28082 Add a note to DROPMALFORMED mode of CSV for column pruning
- Resolved
- links to