Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.4.7
-
None
-
None
Description
In PERMISSIVE mode
Given a csv with multiple columns per row, if your file schema has a single column and you are doing a SELECT in SQL with a condition like '<corrupt_record_column_name> is null', the row is marked as corrupted
BUT if you add an extra column in the file schema and you are not putting that column in SQL SELECT , the row is not marked as corrupted
PS. I don't know exactly what is the right behavior, I didn't find it for PERMISSIVE mode the documentation.
What I found is: As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, the selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier, it is empty in the DROPMALFORMED mode. To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html
I made a "unit" test in order to exemplify the issue: https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java