Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-2300

problems reading DECIMAL from parquet file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.7.0
    • Future
    • Storage - Parquet
    • None

    Description

      There are several issues related to DECIMAL columns in parquet files, depending on which reader is used "simple reader" or "complex reader" and if the columns are OPTIONAL or REQUIRED.

      I have the following test.json file:

      { "a": "1" }
      { "a": "1" }
      { "a": "1" }
      

      I created a parquet file using the following query:

      CREATE TABLE dfs.tmp.`test_all_decimal` AS 
        SELECT 
          CAST(a AS DECIMAL(9,6)) decimal9_opt, 
          CAST('1' AS DECIMAL(9,6)) decimal9_req, 
          CAST(a AS DECIMAL(18, 8)) decimal18_opt, 
          CAST('1' AS DECIMAL(18, 8)) decimal18_req, 
          CAST(a AS DECIMAL(28,10)) decimal28_col, 
          CAST('1' AS DECIMAL(28,10)) decimal28_req, 
          CAST(a AS DECIMAL(38,10)) decimal38_col, 
          CAST('1' AS DECIMAL(38,10)) decimal38_req 
        FROM dfs.data.`test_char.json`;
      

      This creates a parquet file with the following metadata, retrieved using parquet tools:

      creator:       parquet-mr 
      
      file schema:   root 
      -----------------------------------------------------------------------------------------------------
      decimal9_opt:  OPTIONAL INT32 O:DECIMAL R:0 D:1
      decimal9_req:  REQUIRED INT32 O:DECIMAL R:0 D:0
      decimal18_opt: OPTIONAL INT64 O:DECIMAL R:0 D:1
      decimal18_req: REQUIRED INT64 O:DECIMAL R:0 D:0
      decimal28_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
      decimal28_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
      decimal38_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
      decimal38_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
      
      row group 1:   RC:3 TS:636 
      -----------------------------------------------------------------------------------------------------
      decimal9_opt:   INT32 SNAPPY DO:0 FPO:4 SZ:62/58/0.94 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
      decimal9_req:   INT32 SNAPPY DO:0 FPO:66 SZ:56/52/0.93 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED
      decimal18_opt:  INT64 SNAPPY DO:0 FPO:122 SZ:74/70/0.95 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
      decimal18_req:  INT64 SNAPPY DO:0 FPO:196 SZ:68/64/0.94 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED
      decimal28_col:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:264 SZ:72/91/1.26 VC:3 ENC:RLE,BIT_PACKED,PLAIN
      decimal28_req:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:336 SZ:66/85/1.29 VC:3 ENC:BIT_PACKED,PLAIN
      decimal38_col:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:402 SZ:80/111/1.39 VC:3 ENC:RLE,BIT_PACKED,PLAIN
      decimal38_req:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:482 SZ:77/105/1.36 VC:3 ENC:BIT_PACKED,PLAIN
      

      If we disable dictionary encoding:

      alter session set `store.parquet.enable_dictionary_encoding` = false;
      

      We will get a simlar parquet file but with PLAIN encoding instead of PLAIN_DICTIONARY for DECIMAL9 and DECIMAL18 columns.

      When using the "simple" parquet reader, with dictionary encoding enabled,
      The following query returns wrong results for DECIMAL28/REQUIRED and DECIMAL38/REQUIRED (we can't read DECIMAL9 nor DECIMAL18 columns because of DRILL-2262):

      select decimal28_col, decimal28_req, decimal38_col, decimal38_req from dfs.tmp.`test_all_decimal`;
      +---------------+---------------+---------------+---------------+
      | decimal28_opt | decimal28_req | decimal38_opt | decimal38_req |
      +---------------+---------------+---------------+---------------+
      | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
      | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
      | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
      +---------------+---------------+---------------+---------------+
      

      When dictionary encoding is disabled, the following query eturns wrong results for DECIMAL28/REQUIRED and DECIMAL38/REQUIRED:

      select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req, decimal28_opt, decimal28_req, decimal38_opt, decimal38_req from dfs.tmp.`test_all_decimal_nodictionary`;
      +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
      | decimal9_opt | decimal9_req | decimal18_opt | decimal18_req | decimal28_opt | decimal28_req | decimal38_opt | decimal38_req |
      +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
      | 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
      | 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
      | 1.000000     | 1.000000     | 1.00000000    | 1.00000000    | 1.0000000000  | 100000000.0000000000 | 1.0000000000  | 100000000.0000000000 |
      +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
      

      When using the "complex" reader:

      alter session set `store.parquet.use_new_reader` = true;
      

      We can't read DECIMAL28 nor DECIMAL38 because of DRILL-2220.

      the following query gives wrong results for {DECIMAL9}} and DECIMAL18 no matter if dictionary encoding is enabled or disabled:

      select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req from dfs.tmp.`test_all_decimal`;
      +--------------+--------------+---------------+---------------+
      | decimal9_opt | decimal9_req | decimal18_opt | decimal18_req |
      +--------------+--------------+---------------+---------------+
      | 1000000      | 1000000      | 100000000     | 100000000     |
      | 1000000      | 1000000      | 100000000     | 100000000     |
      | 1000000      | 1000000      | 100000000     | 100000000     |
      +--------------+--------------+---------------+---------------+
      

      Attachments

        Issue Links

          Activity

            People

              sphillips Steven Phillips
              adeneche Abdel Hakim Deneche
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: