[SPARK-39901] Reconsider design of ignoreCorruptFiles feature - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
- pull-request-available

Description

I'm filing this ticket as a followup to the discussion at https://github.com/apache/spark/pull/36775#issuecomment-1148136217 regarding the `ignoreCorruptFiles` feature: the current implementation is based towards considering a broad range of IOExceptions to be corruption, but this is likely overly-broad and might mis-identify transient errors as corruption (causing non-corrupt data to be erroneously discarded).

SPARK-39389 fixes one instance of that problem, but we are still vulnerable to similar issues because of the overall design of this feature.

I think we should reconsider the design of this feature: maybe we should switch the default behavior so that only an explicit allowlist of known corruption exceptions can cause files to be skipped. This could be done through involvement of other parts of the code, e.g. rewrapping exceptions into a `CorruptFileException` so higher layers can positively identify corruption.

Any changes to behavior here could potentially impact users jobs, so we'd need to think carefully about when we want to change (in a 3.x release? 4.x?) and how we want to provide escape hatches (e.g. configs to revert back to old behavior).

Attachments

Issue Links

relates to

SPARK-40591 ignoreCorruptFiles results data loss

In Progress

SPARK-39389 Filesystem closed should not be considered as corrupt files

In Progress

links to

GitHub Pull Request #47090

Activity

People

Assignee:: Unassigned

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Jul/22 21:30

Updated:: 12/Oct/24 00:23