Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.0.0
Description
I would expect to see ('A','A') for the first row group and ('B','B') for the second rowgroup.
I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.
>>> import pyarrow as pa >>> import pyarrow.parquet as papq >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"]) >>> t = pa.table({"col":d}) >>> papq.write_table(t,'sample.parquet',row_group_size=100) >>> f = papq.ParquetFile('sample.parquet') >>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max) ('A', 'B') >>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max) ('A', 'B') >>> f.read_row_groups([0]).column(0) <pyarrow.lib.ChunkedArray object at 0x7f37346abe90> [ -- dictionary: [ "A", "B" ] -- indices: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] ] >>> f.read_row_groups([1]).column(0) <pyarrow.lib.ChunkedArray object at 0x7f37346abef0> [ -- dictionary: [ "A", "B" ] -- indices: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ] ]
Attachments
Issue Links
- is related to
-
PARQUET-1783 [C++] Parquet statistics wrong for dictionary type
- Resolved