[PARQUET-1547] [C++] Detect parquet-mr style dictionary_page - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: parquet-cpp
Labels:
None

Description

parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)

So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`

row group 0 
--------------------------------------------------------------------------------
x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y:  BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]

    x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
  file_offset: 4
  file_path: 
  physical_type: DOUBLE
  num_values: 70000
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
      has_min_max: True
      min: 1.0
      max: 5.0
      null_count: 10000
      distinct_count: 0
      num_values: 60000
      physical_type: DOUBLE
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 1632
  total_uncompressed_size: 31635

Is parquet-cpp still able to use the dictionary in this case?
It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.

https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: colin fang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Mar/19 18:28

Updated:: 16/Aug/19 14:06