[PARQUET-1783] [C++] Parquet statistics wrong for dictionary type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: cpp-4.0.0
Fix Version/s: None
Component/s: parquet-cpp
Labels:
None

Description

Observed behaviour

Statistics for categorical data are equivalent for all row groups and refer to the entire CategoricalDtype instead of the data included in the row group.

Expected behaviour

The row group statistics should only include data which is part of the actual row group, not the entire CategoricalDtype

Minimal example

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
table = pa.Table.from_pandas(test_df)
pq.write_table(
    table,
    "test_parquet",
    chunk_size=1,
)
test_parquet = pq.ParquetFile("test_parquet")
test_parquet.metadata.row_group(0).column(0).statistics

Out[1]:
<pyarrow._parquet.Statistics object at 0x1163b5280>
  has_min_max: True
  min: 1
  max: 42
  null_count: 0
  distinct_count: 0
  num_values: 1
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

Expected would be

min:1 max:1 instead of max: 42 for the first row group

Tested with
pandas==1.0.0
pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0)

Attachments

Issue Links

relates to

ARROW-11634 [C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Florian Jetter

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 31/Jan/20 10:22

Updated:: 23/Jun/24 03:31

Resolved:: 12/May/24 14:24