Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2369

Large (>~20 GB) files written to Parquet via PyArrow are corrupted

    XMLWordPrintableJSON

Details

    Description

      When writing large Parquet files (above 10 GB or so) from Pandas to Parquet via the command

      pq.write_table(my_df, 'table.parquet')

      The write succeeds, but when the parquet file is loaded, the error message

      ArrowIOError: Invalid parquet file. Corrupt footer.

      appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 5 GB or so (drawn randomly from the same dataset), everything proceeds as normal. I've also tried this with Pandas df.to_parquet(), with the same results.

      Update: Looks like any DataFrame with size above ~5GB (on disk) returns the same error.

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              jtan Justin Tan
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: