Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1783

[C++] Parquet statistics wrong for dictionary type

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • cpp-4.0.0
    • None
    • parquet-cpp
    • None

    Description

      Observed behaviour

      Statistics for categorical data are equivalent for all row groups and refer to the entire CategoricalDtype instead of the data included in the row group.

      Expected behaviour

      The row group statistics should only include data which is part of the actual row group, not the entire CategoricalDtype

      Minimal example

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
      table = pa.Table.from_pandas(test_df)
      pq.write_table(
          table,
          "test_parquet",
          chunk_size=1,
      )
      test_parquet = pq.ParquetFile("test_parquet")
      test_parquet.metadata.row_group(0).column(0).statistics
      
      Out[1]:
      <pyarrow._parquet.Statistics object at 0x1163b5280>
        has_min_max: True
        min: 1
        max: 42
        null_count: 0
        distinct_count: 0
        num_values: 1
        physical_type: BYTE_ARRAY
        logical_type: String
        converted_type (legacy): UTF8
      

      Expected would be

      min:1 max:1 instead of max: 42 for the first row group

       

      Tested with
      pandas==1.0.0
      pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              fjetter Florian Jetter
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: