Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
0.7.0
-
None
Description
There are several issues related to DECIMAL columns in parquet files, depending on which reader is used "simple reader" or "complex reader" and if the columns are OPTIONAL or REQUIRED.
I have the following test.json file:
{ "a": "1" } { "a": "1" } { "a": "1" }
I created a parquet file using the following query:
CREATE TABLE dfs.tmp.`test_all_decimal` AS SELECT CAST(a AS DECIMAL(9,6)) decimal9_opt, CAST('1' AS DECIMAL(9,6)) decimal9_req, CAST(a AS DECIMAL(18, 8)) decimal18_opt, CAST('1' AS DECIMAL(18, 8)) decimal18_req, CAST(a AS DECIMAL(28,10)) decimal28_col, CAST('1' AS DECIMAL(28,10)) decimal28_req, CAST(a AS DECIMAL(38,10)) decimal38_col, CAST('1' AS DECIMAL(38,10)) decimal38_req FROM dfs.data.`test_char.json`;
This creates a parquet file with the following metadata, retrieved using parquet tools:
creator: parquet-mr file schema: root ----------------------------------------------------------------------------------------------------- decimal9_opt: OPTIONAL INT32 O:DECIMAL R:0 D:1 decimal9_req: REQUIRED INT32 O:DECIMAL R:0 D:0 decimal18_opt: OPTIONAL INT64 O:DECIMAL R:0 D:1 decimal18_req: REQUIRED INT64 O:DECIMAL R:0 D:0 decimal28_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 decimal28_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0 decimal38_col: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 decimal38_req: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0 row group 1: RC:3 TS:636 ----------------------------------------------------------------------------------------------------- decimal9_opt: INT32 SNAPPY DO:0 FPO:4 SZ:62/58/0.94 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED decimal9_req: INT32 SNAPPY DO:0 FPO:66 SZ:56/52/0.93 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED decimal18_opt: INT64 SNAPPY DO:0 FPO:122 SZ:74/70/0.95 VC:3 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED decimal18_req: INT64 SNAPPY DO:0 FPO:196 SZ:68/64/0.94 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED decimal28_col: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:264 SZ:72/91/1.26 VC:3 ENC:RLE,BIT_PACKED,PLAIN decimal28_req: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:336 SZ:66/85/1.29 VC:3 ENC:BIT_PACKED,PLAIN decimal38_col: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:402 SZ:80/111/1.39 VC:3 ENC:RLE,BIT_PACKED,PLAIN decimal38_req: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:482 SZ:77/105/1.36 VC:3 ENC:BIT_PACKED,PLAIN
If we disable dictionary encoding:
alter session set `store.parquet.enable_dictionary_encoding` = false;
We will get a simlar parquet file but with PLAIN encoding instead of PLAIN_DICTIONARY for DECIMAL9 and DECIMAL18 columns.
When using the "simple" parquet reader, with dictionary encoding enabled,
The following query returns wrong results for DECIMAL28/REQUIRED and DECIMAL38/REQUIRED (we can't read DECIMAL9 nor DECIMAL18 columns because of DRILL-2262):
select decimal28_col, decimal28_req, decimal38_col, decimal38_req from dfs.tmp.`test_all_decimal`; +---------------+---------------+---------------+---------------+ | decimal28_opt | decimal28_req | decimal38_opt | decimal38_req | +---------------+---------------+---------------+---------------+ | 1.0000000000 | 100000000.0000000000 | 1.0000000000 | 100000000.0000000000 | | 1.0000000000 | 100000000.0000000000 | 1.0000000000 | 100000000.0000000000 | | 1.0000000000 | 100000000.0000000000 | 1.0000000000 | 100000000.0000000000 | +---------------+---------------+---------------+---------------+
When dictionary encoding is disabled, the following query eturns wrong results for DECIMAL28/REQUIRED and DECIMAL38/REQUIRED:
select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req, decimal28_opt, decimal28_req, decimal38_opt, decimal38_req from dfs.tmp.`test_all_decimal_nodictionary`; +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+ | decimal9_opt | decimal9_req | decimal18_opt | decimal18_req | decimal28_opt | decimal28_req | decimal38_opt | decimal38_req | +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+ | 1.000000 | 1.000000 | 1.00000000 | 1.00000000 | 1.0000000000 | 100000000.0000000000 | 1.0000000000 | 100000000.0000000000 | | 1.000000 | 1.000000 | 1.00000000 | 1.00000000 | 1.0000000000 | 100000000.0000000000 | 1.0000000000 | 100000000.0000000000 | | 1.000000 | 1.000000 | 1.00000000 | 1.00000000 | 1.0000000000 | 100000000.0000000000 | 1.0000000000 | 100000000.0000000000 | +--------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+
When using the "complex" reader:
alter session set `store.parquet.use_new_reader` = true;
We can't read DECIMAL28 nor DECIMAL38 because of DRILL-2220.
the following query gives wrong results for {DECIMAL9}} and DECIMAL18 no matter if dictionary encoding is enabled or disabled:
select decimal9_opt, decimal9_req, decimal18_opt, decimal18_req from dfs.tmp.`test_all_decimal`; +--------------+--------------+---------------+---------------+ | decimal9_opt | decimal9_req | decimal18_opt | decimal18_req | +--------------+--------------+---------------+---------------+ | 1000000 | 1000000 | 100000000 | 100000000 | | 1000000 | 1000000 | 100000000 | 100000000 | | 1000000 | 1000000 | 100000000 | 100000000 | +--------------+--------------+---------------+---------------+
Attachments
Issue Links
- is duplicated by
-
DRILL-2220 Complex reader unable to read FIXED_LEN_BYTE_ARRAY types in parquet file
- Resolved
-
DRILL-2338 DECIMAL 38 when stored as a required datatype results in wrong data
- Resolved