[SPARK-34422] CSV(/JSON?) files with corrupt row + Permissive mode can yield wrong partial result row - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.4.7, 3.0.1, 3.1.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

(This was actually found and fixed in spark-xml, which copied some Spark code for handling bad records. See https://github.com/databricks/spark-xml/issues/517 )

When CSV parsing (or, I think JSON?) encounters a bad record, in Permissive mode, it can return a partial result of values that were successfully parsed, along with the problem input in a new 'corrupt record' column.

However the logic in FailureSafeParser that copies the partial results to the resulting Row has an off-by-one error that arises when the catalyst projection puts the 'corrupt record' column anywhere but the last column, which can readily happen. This could mean the resulting partial results are wrong, or, that processing the bad record in permissive mode fails entirely, if the resulting elements don't happen to match the schema of the result.

The partial results are usually not that useful, so being wrong isn't a huge deal, but, failing entirely in permissive mode is a problem.

Attachments

Activity

People

Assignee:: Sean R. Owen

Reporter:: Sean R. Owen

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Feb/21 13:37

Updated:: 11/Feb/21 16:14

Resolved:: 11/Feb/21 16:14