Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.16.0
-
python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Description
When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into).
Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed.
Note: My solution was to remove `b'ARROW:schema'` data from the `schema.metadata.` this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something.
I should also note that once the metadata gets to big this leads to a buffer overflow in another part of the code 'thrift' which was referenced here: https://issues.apache.org/jira/browse/PARQUET-1345
import pyarrow as pa import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import pathlib import sys def main(): print(f"python: {sys.version}") print(f"pa version: {pa.__version__}") print(f"pd version: {pd.__version__}") fname = "test.pq" path = pathlib.Path(fname) df = pd.DataFrame({"A": [0] * 100000}) df.to_parquet(fname) print(f"Wrote test frame to {fname}") print(f"Size of {fname}: {path.stat().st_size}") for _ in range(5): file = pq.ParquetFile(fname) tmp_df = file.read().to_pandas() print(f"Number of rows on disk: {tmp_df.shape}") print("Reading schema from disk") schema = file.schema.to_arrow_schema() print("Creating new writer") writer = pq.ParquetWriter(fname, schema=schema) print("Re-writing the dataframe") writer.write_table(pa.Table.from_pandas(df)) writer.close() print(f"Size of {fname}: {path.stat().st_size}") if __name__ == "__main__": main()
(sdm) ➜ ~ python growing_metadata.py python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57) [Clang 9.0.0 (tags/RELEASE_900/final)] pa version: 0.16.0 pd version: 0.25.2 Wrote test frame to test.pq Size of test.pq: 1643 Number of rows on disk: (100000, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 3637 Number of rows on disk: (100000, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 8327 Number of rows on disk: (100000, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 19301 Number of rows on disk: (100000, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 44944 Number of rows on disk: (100000, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 104815
Attachments
Attachments
Issue Links
- relates to
-
ARROW-9009 [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files
- Resolved
- links to