Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6057

[Python] Parquet files v2.0 created by spark can't be read by pyarrow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 0.14.1
    • None
    • C++

    Description

      PyArrow uses footer metadata to determine the format version of parquet file, while parquet-mr lib (which is used by spark) determines version on the page level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes version in footer to '1'. See: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913

      Thus, spark can write and read its own written files, pyarrow can write and read its own written files, but when pyarrow tries to read file of version 2.0, which was written by spark it throws an error about malformed file (because it thinks that format version is 1.0).

      Depending on the compression method an error is:

      Corrupt snappy compressed data

      GZipCodec failed: incorrect header check

      ArrowIOError: Unknown encoding type

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gs_vlad Vladyslav Shamaida
              Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: