Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Cannot Reproduce
-
0.14.1
-
None
Description
PyArrow uses footer metadata to determine the format version of parquet file, while parquet-mr lib (which is used by spark) determines version on the page level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes version in footer to '1'. See: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913
Thus, spark can write and read its own written files, pyarrow can write and read its own written files, but when pyarrow tries to read file of version 2.0, which was written by spark it throws an error about malformed file (because it thinks that format version is 1.0).
Depending on the compression method an error is:
- Corrupt snappy compressed data
- GZipCodec failed: incorrect header check
- ArrowIOError: Unknown encoding type
Attachments
Issue Links
- is caused by
-
PARQUET-458 [C++] Implement support for DataPageV2
- Resolved