[SPARK-28058] Reading csv with DROPMALFORMED sometimes doesn't drop malformed records - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.1, 2.4.3
Fix Version/s: 2.4.4, 3.0.0
Component/s: SQL
Labels:
None

Description

The spark sql csv reader is not dropping malformed records as expected.

Consider this file (fruit.csv). Notice it contains a header record, 3 valid records, and one malformed record.

fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx

If I read this file using the spark sql csv reader as follows, everything looks good. The malformed record is dropped.

scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
+------+------+-----+--------+                                                  
|fruit |color |price|quantity|
+------+------+-----+--------+
|apple |red   |1    |3       |
|banana|yellow|2    |4       |
|orange|orange|3    |5       |
+------+------+-----+--------+

However, if I select a subset of the columns, the malformed record is not dropped. The malformed data is placed in the first column, and the remaining column(s) are filled with nulls.

scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
+------+
|fruit |
+------+
|apple |
|banana|
|orange|
|xxx   |
+------+

scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+------+------+
|fruit |color |
+------+------+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+------+------+

scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price).show(truncate=false)
+------+------+-----+
|fruit |color |price|
+------+------+-----+
|apple |red   |1    |
|banana|yellow|2    |
|orange|orange|3    |
|xxx   |null  |null |
+------+------+-----+

And finally, if I manually select all of the columns, the malformed record is once again dropped.

scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 'quantity).show(truncate=false)
+------+------+-----+--------+
|fruit |color |price|quantity|
+------+------+-----+--------+
|apple |red   |1    |3       |
|banana|yellow|2    |4       |
|orange|orange|3    |5       |
+------+------+-----+--------+

I would expect the malformed record(s) to be dropped regardless of which columns are being selected from the file.

Attachments

Issue Links

is duplicated by

SPARK-28079 CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

Resolved

SPARK-28082 Add a note to DROPMALFORMED mode of CSV for column pruning

Resolved

links to

GitHub Pull Request #24894

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Stuart White

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Jun/19 20:48

Updated:: 12/Dec/22 18:10

Resolved:: 18/Jun/19 04:49