Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8980

[Python] Metadata grows exponentially when using schema from disk

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.16.0
    • 1.0.0
    • Python
    • python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
      [Clang 9.0.0 (tags/RELEASE_900/final)]
      pa version: 0.16.0
      pd version: 0.25.2

    Description

      When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into).

      Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed.

      Note: My solution was to remove `b'ARROW:schema'` data from the `schema.metadata.` this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something.

      I should also note that once the metadata gets to big this leads to a buffer overflow in another part of the code 'thrift' which was referenced here: https://issues.apache.org/jira/browse/PARQUET-1345

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow as pa
      import pandas as pd
      import pathlib
      import sys
      def main():
          print(f"python: {sys.version}")
          print(f"pa version: {pa.__version__}")
          print(f"pd version: {pd.__version__}")    fname = "test.pq"
          path = pathlib.Path(fname)    df = pd.DataFrame({"A": [0] * 100000})
          df.to_parquet(fname)    print(f"Wrote test frame to {fname}")
          print(f"Size of {fname}: {path.stat().st_size}")    for _ in range(5):
              file = pq.ParquetFile(fname)
              tmp_df = file.read().to_pandas()
              print(f"Number of rows on disk: {tmp_df.shape}")
              print("Reading schema from disk")
              schema = file.schema.to_arrow_schema()
              print("Creating new writer")
              writer = pq.ParquetWriter(fname, schema=schema)
              print("Re-writing the dataframe")
              writer.write_table(pa.Table.from_pandas(df))
              writer.close()
              print(f"Size of {fname}: {path.stat().st_size}")
      if __name__ == "__main__":
          main()
      
      (sdm) ➜ ~ python growing_metadata.py
      python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
      [Clang 9.0.0 (tags/RELEASE_900/final)]
      pa version: 0.16.0
      pd version: 0.25.2
      Wrote test frame to test.pq
      Size of test.pq: 1643
      Number of rows on disk: (100000, 1)
      Reading schema from disk
      Creating new writer
      Re-writing the dataframe
      Size of test.pq: 3637
      Number of rows on disk: (100000, 1)
      Reading schema from disk
      Creating new writer
      Re-writing the dataframe
      Size of test.pq: 8327
      Number of rows on disk: (100000, 1)
      Reading schema from disk
      Creating new writer
      Re-writing the dataframe
      Size of test.pq: 19301
      Number of rows on disk: (100000, 1)
      Reading schema from disk
      Creating new writer
      Re-writing the dataframe
      Size of test.pq: 44944
      Number of rows on disk: (100000, 1)
      Reading schema from disk
      Creating new writer
      Re-writing the dataframe
      Size of test.pq: 104815

      Attachments

        1. growing_metadata.py
          1 kB
          Kevin Glasson
        2. test.pq
          102 kB
          Kevin Glasson

        Issue Links

          Activity

            People

              wesm Wes McKinney
              kevinglasson Kevin Glasson
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m