[ARROW-11634] [C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 6.0.0
Component/s: C++, Parquet
Labels:
- parquet
- parquet-statistics

External issue URL:
https://github.com/apache/arrow/issues/27497

Description

I would expect to see ('A','A') for the first row group and ('B','B') for the second rowgroup.

I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.

>>> import pyarrow as pa
>>> import pyarrow.parquet as papq
>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>> t = pa.table({"col":d})
>>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>> f = papq.ParquetFile('sample.parquet')
>>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
('A', 'B')
>>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
('A', 'B')
>>> f.read_row_groups([0]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
[ 
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      ...
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ]
]
>>> f.read_row_groups([1]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
[
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      ...
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1
    ]
]

Attachments

Issue Links

is related to

PARQUET-1783 [C++] Parquet statistics wrong for dictionary type

Resolved

Activity

People

Assignee:: Weston Pace

Reporter:: Daniel Nugent

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Feb/21 18:35

Updated:: 11/Jan/23 08:21

Resolved:: 15/Sep/21 00:13