Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39901

Reconsider design of ignoreCorruptFiles feature

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • SQL

    Description

      I'm filing this ticket as a followup to the discussion at https://github.com/apache/spark/pull/36775#issuecomment-1148136217 regarding the `ignoreCorruptFiles` feature: the current implementation is based towards considering a broad range of IOExceptions to be corruption, but this is likely overly-broad and might mis-identify transient errors as corruption (causing non-corrupt data to be erroneously discarded).

      SPARK-39389 fixes one instance of that problem, but we are still vulnerable to similar issues because of the overall design of this feature.

      I think we should reconsider the design of this feature: maybe we should switch the default behavior so that only an explicit allowlist of known corruption exceptions can cause files to be skipped. This could be done through involvement of other parts of the code, e.g. rewrapping exceptions into a `CorruptFileException` so higher layers can positively identify corruption.

      Any changes to behavior here could potentially impact users jobs, so we'd need to think carefully about when we want to change (in a 3.x release? 4.x?) and how we want to provide escape hatches (e.g. configs to revert back to old behavior). 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: