Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12513

[C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls

    XMLWordPrintableJSON

Details

    Description

      When writing a Table as Parquet, when the table contains columns represented as dictionary-encoded arrays, those columns show an incorrect null_count of 0 in the Parquet metadata.  If the same data is saved without dictionary-encoding the array, then the null_count is correct.

      Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.

      NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ implementation of the Arrow/Parquet writer.

      Setup

      import pyarrow as pa
      from pyarrow import parquet

      Bug

      (writes a dictionary encoded Arrow array to parquet)

      array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
      assert array1.null_count == 5
      array1dict = array1.dictionary_encode()
      assert array1dict.null_count == 5
      table = pa.Table.from_arrays([array1dict], ["mycol"])
      parquet.write_table(table, "testtable.parquet")
      meta = parquet.read_metadata("testtable.parquet")
      meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!)

      Correct

      (writes same data without dictionary encoding the Arrow array)

      array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
      assert array1.null_count == 5
      table = pa.Table.from_arrays([array1], ["mycol"])
      parquet.write_table(table, "testtable.parquet")
      meta = parquet.read_metadata("testtable.parquet")
      meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
      

       

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              dbeach24 David Beach
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6.5h
                  6.5h