Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.0.0
-
None
-
None
-
MacOS. Python 3.8.10.
pyarrow: '7.0.0'
pandas: '1.4.2'
numpy: '1.22.3'
Description
I'm trying to follow the example here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:
from pathlib import Path import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq size = 100_000_000 partition_col = np.random.randint(0, 10, size) values = np.random.rand(size) table = pa.Table.from_pandas( pd.DataFrame({"partition_col": partition_col, "values": values}) ) metadata_collector = [] root_path = Path("random.parquet") pq.write_to_dataset( table, root_path, partition_cols=["partition_col"], metadata_collector=metadata_collector, ) Write the ``_common_metadata`` parquet file without row groups statistics pq.write_metadata(table.schema, root_path / "_common_metadata") Write the ``_metadata`` parquet file with row groups statistics of all files pq.write_metadata( table.schema, root_path / "_metadata", metadata_collector=metadata_collector )
This raises the error
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Input In [92], in <cell line: 1>() ----> 1 pq.write_metadata( 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector 3 ) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs) 2322 metadata = read_metadata(where) 2323 for m in metadata_collector: -> 2324 metadata.append_row_groups(m) 2325 metadata.write_metadata_file(where) File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups() RuntimeError: AppendRowGroups requires equal schemas.
But all schemas in the `metadata_collector` list seem to be the same:
all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
# True