[ARROW-8980] [Python] Metadata grows exponentially when using schema from disk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.16.0
Fix Version/s: 1.0.0
Component/s: Python
Labels:
- metadata
- parquet
- pull-request-available
- pyarrow
- python
- schema
Environment:
python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2

External issue URL:
https://github.com/apache/arrow/issues/25103

Description

When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into).

Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed.

Note: My solution was to remove `b'ARROW:schema'` data from the `schema.metadata.` this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something.

I should also note that once the metadata gets to big this leads to a buffer overflow in another part of the code 'thrift' which was referenced here: https://issues.apache.org/jira/browse/PARQUET-1345

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import pathlib
import sys
def main():
    print(f"python: {sys.version}")
    print(f"pa version: {pa.__version__}")
    print(f"pd version: {pd.__version__}")    fname = "test.pq"
    path = pathlib.Path(fname)    df = pd.DataFrame({"A": [0] * 100000})
    df.to_parquet(fname)    print(f"Wrote test frame to {fname}")
    print(f"Size of {fname}: {path.stat().st_size}")    for _ in range(5):
        file = pq.ParquetFile(fname)
        tmp_df = file.read().to_pandas()
        print(f"Number of rows on disk: {tmp_df.shape}")
        print("Reading schema from disk")
        schema = file.schema.to_arrow_schema()
        print("Creating new writer")
        writer = pq.ParquetWriter(fname, schema=schema)
        print("Re-writing the dataframe")
        writer.write_table(pa.Table.from_pandas(df))
        writer.close()
        print(f"Size of {fname}: {path.stat().st_size}")
if __name__ == "__main__":
    main()

(sdm) ➜ ~ python growing_metadata.py
python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Wrote test frame to test.pq
Size of test.pq: 1643
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 3637
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 8327
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 19301
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 44944
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 104815

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

growing_metadata.py
29/May/20 11:18
1 kB
Kevin Glasson
test.pq
29/May/20 11:18
102 kB
Kevin Glasson

Issue Links

relates to

ARROW-9009 [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files

Resolved

links to

GitHub Pull Request #7577

Activity

People

Assignee:: Wes McKinney

Reporter:: Kevin Glasson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/May/20 11:19

Updated:: 11/Jan/23 08:03

Resolved:: 29/Jun/20 04:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m