Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
cpp-4.0.0
-
None
-
None
Description
Observed behaviour
Statistics for categorical data are equivalent for all row groups and refer to the entire CategoricalDtype instead of the data included in the row group.
Expected behaviour
The row group statistics should only include data which is part of the actual row group, not the entire CategoricalDtype
Minimal example
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])}) table = pa.Table.from_pandas(test_df) pq.write_table( table, "test_parquet", chunk_size=1, ) test_parquet = pq.ParquetFile("test_parquet") test_parquet.metadata.row_group(0).column(0).statistics
Out[1]:
<pyarrow._parquet.Statistics object at 0x1163b5280>
has_min_max: True
min: 1
max: 42
null_count: 0
distinct_count: 0
num_values: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
Expected would be
min:1 max:1 instead of max: 42 for the first row group
Tested with
pandas==1.0.0
pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0)
Attachments
Issue Links
- relates to
-
ARROW-11634 [C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect
- Resolved