[ARROW-6057] [Python] Parquet files v2.0 created by spark can't be read by pyarrow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 0.14.1
Fix Version/s: None
Component/s: C++
Labels:
- parquet

External issue URL:
https://github.com/apache/arrow/issues/22459

Description

PyArrow uses footer metadata to determine the format version of parquet file, while parquet-mr lib (which is used by spark) determines version on the page level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes version in footer to '1'. See: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913

Thus, spark can write and read its own written files, pyarrow can write and read its own written files, but when pyarrow tries to read file of version 2.0, which was written by spark it throws an error about malformed file (because it thinks that format version is 1.0).

Depending on the compression method an error is:

- Corrupt snappy compressed data

- GZipCodec failed: incorrect header check

- ArrowIOError: Unknown encoding type

Attachments

Issue Links

is caused by

PARQUET-458 [C++] Implement support for DataPageV2

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Vladyslav Shamaida

Votes:: 3 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Jul/19 08:52

Updated:: 11/Jan/23 07:44

Resolved:: 20/Feb/21 03:55