Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4048

Parquet reader corrupts dictionary encoded binary columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.3.0
    • 1.4.0
    • Storage - Parquet
    • None

    Description

      git.commit.id.abbrev=04c01bd

      The below query returns corrupted data (not even showing up here) for binary columns

      select * from `lineitem_dic_enc.parquet` limit 1;
      +-------------+------------+------------+---------------+-------------+------------------+-------------+--------+---------------+---------------+-------------+---------------+----------------+--------------------+-------------+--------------------------+
      | l_orderkey  | l_partkey  | l_suppkey  | l_linenumber  | l_quantity  | l_extendedprice  | l_discount  | l_tax  | l_returnflag  | l_linestatus  | l_shipdate  | l_commitdate  | l_receiptdate  |   l_shipinstruct   | l_shipmode  |        l_comment         |
      +-------------+------------+------------+---------------+-------------+------------------+-------------+--------+---------------+---------------+-------------+---------------+----------------+--------------------+-------------+--------------------------+
      | 1           | 1552       | 93         | 1             | 17.0        | 24710.35         | 0.04        | 0.02   |              |              | 1996-03-13  | 1996-02-12    | 1996-03-22     | DELIVER IN PE  | T       | egular courts above the  |
      +-------------+------------+------------+---------------+-------------+------------------+-------------+--------+---------------+---------------+-------------+---------------+----------------+--------------------+-------------+--------------------------+
      

      The same query from an older build (git.commit.id.abbrev=839f8da)

      select * from `lineitem_dic_enc.parquet` limit 1;
      +-------------+------------+------------+---------------+-------------+------------------+-------------+--------+---------------+---------------+-------------+---------------+----------------+--------------------+-------------+--------------------------+
      | l_orderkey  | l_partkey  | l_suppkey  | l_linenumber  | l_quantity  | l_extendedprice  | l_discount  | l_tax  | l_returnflag  | l_linestatus  | l_shipdate  | l_commitdate  | l_receiptdate  |   l_shipinstruct   | l_shipmode  |        l_comment         |
      +-------------+------------+------------+---------------+-------------+------------------+-------------+--------+---------------+---------------+-------------+---------------+----------------+--------------------+-------------+--------------------------+
      | 1           | 1552       | 93         | 1             | 17.0        | 24710.35         | 0.04        | 0.02   | N             | O             | 1996-03-13  | 1996-02-12    | 1996-03-22     | DELIVER IN PERSON  | TRUCK       | egular courts above the  |
      +-------------+------------+------------+---------------+-------------+------------------+-------------+--------+---------------+---------------+-------------+---------------+----------------+--------------------+-------------+--------------------------+
      

      Below is the output of the parquet-meta command for this dataset

      creator:         parquet-mr 
      
      file schema:     root 
      -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      l_orderkey:      REQUIRED INT32 R:0 D:0
      l_partkey:       REQUIRED INT32 R:0 D:0
      l_suppkey:       REQUIRED INT32 R:0 D:0
      l_linenumber:    REQUIRED INT32 R:0 D:0
      l_quantity:      REQUIRED DOUBLE R:0 D:0
      l_extendedprice: REQUIRED DOUBLE R:0 D:0
      l_discount:      REQUIRED DOUBLE R:0 D:0
      l_tax:           REQUIRED DOUBLE R:0 D:0
      l_returnflag:    REQUIRED BINARY O:UTF8 R:0 D:0
      l_linestatus:    REQUIRED BINARY O:UTF8 R:0 D:0
      l_shipdate:      REQUIRED INT32 O:DATE R:0 D:0
      l_commitdate:    REQUIRED INT32 O:DATE R:0 D:0
      l_receiptdate:   REQUIRED INT32 O:DATE R:0 D:0
      l_shipinstruct:  REQUIRED BINARY O:UTF8 R:0 D:0
      l_shipmode:      REQUIRED BINARY O:UTF8 R:0 D:0
      l_comment:       REQUIRED BINARY O:UTF8 R:0 D:0
      
      row group 1:     RC:60175 TS:3049610 
      -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      l_orderkey:       INT32 SNAPPY DO:0 FPO:4 SZ:146159/165487/1.13 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_partkey:        INT32 SNAPPY DO:0 FPO:146163 SZ:90867/90918/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_suppkey:        INT32 SNAPPY DO:0 FPO:237030 SZ:53244/53230/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_linenumber:     INT32 SNAPPY DO:0 FPO:290274 SZ:14909/22767/1.53 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_quantity:       DOUBLE SNAPPY DO:0 FPO:305183 SZ:45536/45715/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_extendedprice:  DOUBLE SNAPPY DO:0 FPO:350719 SZ:327454/407907/1.25 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_discount:       DOUBLE SNAPPY DO:0 FPO:678173 SZ:30349/30359/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_tax:            DOUBLE SNAPPY DO:0 FPO:708522 SZ:30334/30342/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_returnflag:     BINARY SNAPPY DO:0 FPO:738856 SZ:14700/14714/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_linestatus:     BINARY SNAPPY DO:0 FPO:753556 SZ:8964/9506/1.06 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_shipdate:       INT32 SNAPPY DO:0 FPO:762520 SZ:100537/100514/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_commitdate:     INT32 SNAPPY DO:0 FPO:863057 SZ:100314/100282/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_receiptdate:    INT32 SNAPPY DO:0 FPO:963371 SZ:100584/100558/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_shipinstruct:   BINARY SNAPPY DO:0 FPO:1063955 SZ:15311/15303/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_shipmode:       BINARY SNAPPY DO:0 FPO:1079266 SZ:22800/22797/1.00 VC:60175 ENC:BIT_PACKED,PLAIN_DICTIONARY
      l_comment:        BINARY SNAPPY DO:0 FPO:1102066 SZ:795339/1839211/2.31 VC:60175 ENC:PLAIN,BIT_PACKED
      

      Attachments

        1. lineitem_dic_enc.parquet
          1.81 MB
          Rahul Kumar Challapalli

        Activity

          People

            jaltekruse Jason Altekruse
            rkins Rahul Kumar Challapalli
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: