Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)
So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`
row group 0 -------------------------------------------------------------------------------- x: DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] y: BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000] x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120> file_offset: 4 file_path: physical_type: DOUBLE num_values: 70000 path_in_schema: x is_stats_set: True statistics: <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0> has_min_max: True min: 1.0 max: 5.0 null_count: 10000 distinct_count: 0 num_values: 60000 physical_type: DOUBLE compression: SNAPPY encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED') has_dictionary_page: False dictionary_page_offset: None data_page_offset: 4 total_compressed_size: 1632 total_uncompressed_size: 31635
Is parquet-cpp still able to use the dictionary in this case?
It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.