Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1547

[C++] Detect parquet-mr style dictionary_page

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • parquet-cpp
    • None

    Description

      parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)

      So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`

      row group 0 
      --------------------------------------------------------------------------------
      x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
      y:  BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
      
          x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
          ----------------------------------------------------------------------------
          page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
      
      
      <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
        file_offset: 4
        file_path: 
        physical_type: DOUBLE
        num_values: 70000
        path_in_schema: x
        is_stats_set: True
        statistics:
          <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
            has_min_max: True
            min: 1.0
            max: 5.0
            null_count: 10000
            distinct_count: 0
            num_values: 60000
            physical_type: DOUBLE
        compression: SNAPPY
        encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
        has_dictionary_page: False
        dictionary_page_offset: None
        data_page_offset: 4
        total_compressed_size: 1632
        total_uncompressed_size: 31635
      

      Is parquet-cpp still able to use the dictionary in this case?
      It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.

      https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/

      Attachments

        Activity

          People

            Unassigned Unassigned
            colinfang colin fang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: