Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3564

[Python] writing version 2.0 parquet format with dictionary encoding enabled

    XMLWordPrintableJSON

Details

    Description

      Using pyarrow v0.11.0, the attached script writes a simple table (lifted from the pyarrow doc) to both parquet format versions 1.0 and 2.0, with and without dictionary encoding enabled.

      Inspecting the written files using parquet-tools appears to show that dictionary encoding is not used in either of the version 2.0 files.  Both files report that the columns are encoded using PLAIN,RLE and that the dictionary page offset is zero.  I was expecting that the column encoding would include RLE_DICTIONARY. Attached are the script with repro steps and the files that were generated by it.

      Below is the output of using parquet-tools meta on the version 2.0 files

      version='2.0', use_dictionary = True
      % parquet-tools meta example_v2.0_dict_True.parquet
      file:              file:.../example_v2.0_dict_True.parquet
      creator:           parquet-cpp version 1.5.1-SNAPSHOT { Unknown macro: {extra}
      }}
       
      file schema:       schema
      --------------------------------------------------------------------------------
      one:               OPTIONAL DOUBLE R:0 D:1
      three:             OPTIONAL BOOLEAN R:0 D:1
      two:               OPTIONAL BINARY R:0 D:1
      _index_level_0_: OPTIONAL BINARY R:0 D:1
       
      row group 1:       RC:3 TS:211 OFFSET:4
      --------------------------------------------------------------------------------
      one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]
      three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]
      two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]
      _index_level_0_:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]|
      version='2.0', use_dictionary = False

      |% parquet-tools meta example_v2.0_dict_False.parquet
      file:              file:.../example_v2.0_dict_False.parquet
      creator:           parquet-cpp version 1.5.1-SNAPSHOT { Unknown macro: {extra}

      }}
       
      file schema:       schema
      --------------------------------------------------------------------------------
      one:               OPTIONAL DOUBLE R:0 D:1
      three:             OPTIONAL BOOLEAN R:0 D:1
      two:               OPTIONAL BINARY R:0 D:1
      _index_level_0_: OPTIONAL BINARY R:0 D:1
       
      row group 1:       RC:3 TS:211 OFFSET:4
      --------------------------------------------------------------------------------
      one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]
      three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]
      two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]
      _index_level_0_:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]

      Attachments

        1. pyarrow_repro.py
          0.5 kB
          Hatem Helal
        2. example_v2.0_dict_True.parquet
          1 kB
          Hatem Helal
        3. example_v2.0_dict_False.parquet
          1 kB
          Hatem Helal
        4. example_v1.0_dict_True.parquet
          1 kB
          Hatem Helal
        5. example_v1.0_dict_False.parquet
          1 kB
          Hatem Helal

        Issue Links

          Activity

            People

              hatem Hatem Helal
              hatem Hatem Helal
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m