Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6057

[Python] Parquet files v2.0 created by spark can't be read by pyarrow

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.14.1
    • Fix Version/s: None
    • Component/s: C++
    • Labels:

      Description

      PyArrow uses footer metadata to determine the format version of parquet file, while parquet-mr lib (which is used by spark) determines version on the page level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes version in footer to '1'. See: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913

      Thus, spark can write and read its own written files, pyarrow can write and read its own written files, but when pyarrow tries to read file of version 2.0, which was written by spark it throws an error about malformed file (because it thinks that format version is 1.0).

      Depending on the compression method an error is:

      Corrupt snappy compressed data

      GZipCodec failed: incorrect header check

      ArrowIOError: Unknown encoding type

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                gs_vlad Vladyslav Shamaida
              • Votes:
                3 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: