[SPARK-28079] CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.3.2, 2.4.3
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example:

// Second row has a 4th column that is not declared in the header/schema
val csvText = s"""
                 | FieldA, FieldB, FieldC
                 | a1,b1,c1
                 | a2,b2,c2,d*""".stripMargin

val csvFile = new File("/tmp/file.csv")
FileUtils.write(csvFile, csvText)

val reader = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "corrupt")
  .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")

reader.load(csvFile.getAbsolutePath).show(truncate = false)

This produces the correct result:

+------------+------+------+------+
|corrupt     |fieldA|fieldB|fieldC|
+------------+------+------+------+
|null        | a1   |b1    |c1    |
| a2,b2,c2,d*| a2   |b2    |c2    |
+------------+------+------+------+

However removing the "schema" option and going:

val reader = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "corrupt")

reader.load(csvFile.getAbsolutePath).show(truncate = false)

Yields:

+-------+-------+-------+
| FieldA| FieldB| FieldC|
+-------+-------+-------+
| a1    |b1     |c1     |
| a2    |b2     |c2     |
+-------+-------+-------+

The fourth value "d*" in the second row has been removed and the row not marked as corrupt

Attachments

Issue Links

duplicates

SPARK-28058 Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: F Jimenez

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Jun/19 09:56

Updated:: 12/Dec/22 18:10

Resolved:: 20/Jun/19 03:53