Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19082

The config ignoreCorruptFiles doesn't work for Parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.1, 2.2.0
    • SQL
    • None

    Description

      We have a config spark.sql.files.ignoreCorruptFiles which can be used to ignore corrupt files when reading files in SQL. Currently the ignoreCorruptFiles config has two issues and can't work for Parquet:

      1. We only ignore corrupt files in FileScanRDD . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html

      2. In FileScanRDD, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, ignoreCorruptFiles config doesn't work too.

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              viirya L. C. Hsieh
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: