Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11634

[C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect

    XMLWordPrintableJSON

Details

    Description

      I would expect to see ('A','A') for the first row group and ('B','B') for the second rowgroup.

      I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.

      >>> import pyarrow as pa
      >>> import pyarrow.parquet as papq
      >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
      >>> t = pa.table({"col":d})
      >>> papq.write_table(t,'sample.parquet',row_group_size=100)
      >>> f = papq.ParquetFile('sample.parquet')
      >>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
      ('A', 'B')
      >>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
      ('A', 'B')
      >>> f.read_row_groups([0]).column(0)
      <pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
      [ 
        -- dictionary:
          [
            "A",
            "B"
          ]
        -- indices:
          [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            ...
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0
          ]
      ]
      >>> f.read_row_groups([1]).column(0)
      <pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
      [
        -- dictionary:
          [
            "A",
            "B"
          ]
        -- indices:
          [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            ...
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
          ]
      ]
      

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              nugend Daniel Nugent
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: