Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2369

Large (>~20 GB) files written to Parquet via PyArrow are corrupted

    Details

      Description

      When writing large Parquet files (above 10 GB or so) from Pandas to Parquet via the command

      pq.write_table(my_df, 'table.parquet')

      The write succeeds, but when the parquet file is loaded, the error message

      ArrowIOError: Invalid parquet file. Corrupt footer.

      appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 5 GB or so (drawn randomly from the same dataset), everything proceeds as normal. I've also tried this with Pandas df.to_parquet(), with the same results.

      Update: Looks like any DataFrame with size above ~5GB (on disk) returns the same error.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                pitrou Antoine Pitrou
                Reporter:
                jtan Justin Tan
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: