Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.0.1, 2.0.0, 3.0.0
-
RHEL6
Description
When writing a Table as Parquet, when the table contains columns represented as dictionary-encoded arrays, those columns show an incorrect null_count of 0 in the Parquet metadata. If the same data is saved without dictionary-encoding the array, then the null_count is correct.
Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ implementation of the Arrow/Parquet writer.
Setup
import pyarrow as pa from pyarrow import parquet
Bug
(writes a dictionary encoded Arrow array to parquet)
array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) assert array1.null_count == 5 array1dict = array1.dictionary_encode() assert array1dict.null_count == 5 table = pa.Table.from_arrays([array1dict], ["mycol"]) parquet.write_table(table, "testtable.parquet") meta = parquet.read_metadata("testtable.parquet") meta.row_group(0).column(0).statistics.null_count # RESULT: 0 (WRONG!)
Correct
(writes same data without dictionary encoding the Arrow array)
array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string()) assert array1.null_count == 5 table = pa.Table.from_arrays([array1], ["mycol"]) parquet.write_table(table, "testtable.parquet") meta = parquet.read_metadata("testtable.parquet") meta.row_group(0).column(0).statistics.null_count # RESULT: 5 (CORRECT)
Attachments
Issue Links
- links to