[SPARK-34042] Column pruning is not working as expected for PERMISIVE mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.7
Fix Version/s: None
Component/s: Java API
Labels:
None

Description

In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single column and you are doing a SELECT in SQL with a condition like '<corrupt_record_column_name> is null', the row is marked as corrupted

BUT if you add an extra column in the file schema and you are not putting that column in SQL SELECT , the row is not marked as corrupted

PS. I don't know exactly what is the right behavior, I didn't find it for PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, the selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier, it is empty in the DROPMALFORMED mode. To restore the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.

https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html

I made a "unit" test in order to exemplify the issue: https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Marius Butan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jan/21 08:13

Updated:: 12/Dec/22 18:10