Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.15.1
Description
Parquet file metadata for Decimal type columns contain min and max values that are not decoded from bytes into Decimals. This causes issues in dependent libraries like Dask (see https://github.com/dask/dask/issues/5647).
Reproducible example
from decimal import Decimal import random import pandas as pd import pyarrow.parquet as pq import pyarrow as pa NUM_DATA_POINTS_PER_PARTITION = 25 random.seed(0) data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)] df = pd.DataFrame(data1) table = pa.Table.from_pandas(df) pq.write_table(table, 'my_data.parquet') parquet_file = pq.ParquetFile('my_data.parquet') assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, Decimal) # <-- AssertionError here because min has type bytes rather than Decimal assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, Decimal)
Attachments
Issue Links
- links to