[SPARK-30650] The parquet file written by spark often incurs corrupted footer and hence not readable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: Block Manager, Input/Output, Optimizer, Spark Core, SQL
Labels:
None

Description

This issue is similar to an archived one,

https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3CJIRA.12767358.1421214067000.78480.1421214094403@Atlassian.JIRA%3E

The parquet file written by spark often incurs corrupted footer and hence not readable by spark.

The issue is more consistent when the granularity of a field increases. i.e. when redundancy of values in dataset is reduced(= more number of unique values).

Coalesce also doesn't help here. It automatically generated a certain number of parquet files, each with a definite size as controlled by spark internals. But, few of them written corrupted footer. But writing job ends with success status.

Here are few examples,

There are the files(267.2 M each) which the 1.6.x version spark has generated. But few of them are found with corrupted footer and hence not readable. This scenario happens more frequently when the file(input) size exceeds a certain limit and also the level of redundancy of the data matters. With the same file size, Lesser the level of redundancy, more is the probability of getting the footer corrupted.

Hence in iterations of the job when those are required to read for processing, ends up with
Can not read value 0 in block n in file xxxx

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: DILIP KUMAR MOHAPATRO

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Jan/20 12:43

Updated:: 12/Dec/22 18:10

Resolved:: 30/Jan/20 01:28