Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.0.0
-
None
Description
The distinct_count attribute of the column chunk metadata statistics is broken: It always shows 0. This seems to be the case for all types of columns. Checked with int64 as well as dictionary encoded string columns:
import pyarrow as pa import pyarrow.parquet as pq table = pa.Table.from_pydict({ 'foo': pa.array(['ABC', 'DEF']).dictionary_encode() }) pq.write_table(table, 'test_row_group_statistics.parquet', version='2.0', data_page_version='2.0') pq_file = pq.ParquetFile('test_row_group_statistics.parquet') print(pq_file.metadata.row_group(0).column(0).statistics)
Output:
<pyarrow._parquet.Statistics object at 0x0000020A1699D770> has_min_max: True min: ABC max: DEF null_count: 0 distinct_count: 0 num_values: 2 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8
From a quick grep it seems it's just never set by the writer in the first place. Possibly a Parquet file not written by Arrow C++ would set this.