Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30650

The parquet file written by spark often incurs corrupted footer and hence not readable

    XMLWordPrintableJSON

    Details

      Description

      This issue is similar to an archived one,

      https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3CJIRA.12767358.1421214067000.78480.1421214094403@Atlassian.JIRA%3E

      The parquet file written by spark often incurs corrupted footer and hence not readable by spark.

      The issue is more consistent when the granularity of a field increases. i.e. when redundancy of values in dataset is reduced(= more number of unique values).

      Coalesce also doesn't help here. It automatically generated a certain number of parquet files, each with a definite size as controlled by spark internals. But, few of them written corrupted footer. But writing job ends with success status. 

      Here are few examples,

      There are the files(267.2 M each) which the 1.6.x version spark has generated. But few of them are found with corrupted footer and hence not readable. This scenario happens more frequently when the file(input) size exceeds a certain limit and also the level of redundancy of the data matters. With the same file size, Lesser the level of redundancy, more is the probability of getting the footer corrupted.

      Hence in iterations of the job when those are required to read for processing, ends up with
      Can not read value 0 in block n in file xxxx

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dilipm DILIP KUMAR MOHAPATRO
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: