Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
Description
I'm filing this ticket as a followup to the discussion at https://github.com/apache/spark/pull/36775#issuecomment-1148136217 regarding the `ignoreCorruptFiles` feature: the current implementation is based towards considering a broad range of IOExceptions to be corruption, but this is likely overly-broad and might mis-identify transient errors as corruption (causing non-corrupt data to be erroneously discarded).
SPARK-39389 fixes one instance of that problem, but we are still vulnerable to similar issues because of the overall design of this feature.
I think we should reconsider the design of this feature: maybe we should switch the default behavior so that only an explicit allowlist of known corruption exceptions can cause files to be skipped. This could be done through involvement of other parts of the code, e.g. rewrapping exceptions into a `CorruptFileException` so higher layers can positively identify corruption.
Any changes to behavior here could potentially impact users jobs, so we'd need to think carefully about when we want to change (in a 3.x release? 4.x?) and how we want to provide escape hatches (e.g. configs to revert back to old behavior).
Attachments
Issue Links
- relates to
-
SPARK-40591 ignoreCorruptFiles results data loss
- In Progress
-
SPARK-39389 Filesystem closed should not be considered as corrupt files
- In Progress
- links to