Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
-
CDH 5.8.3
Description
It seems like values from different definition levels gets mixed-up by the parquet reader when reading arrays of structs with nested structs if the scalar values in the structs are of the same type and have the same position.
Consider the following schema:
message hive_schema { optional group array_column (LIST) { repeated group bag { optional group array_element { optional group c11 { optional int32 c21; optional int32 c22; optional int32 c23; optional int64 c24; } optional int32 c12; optional int64 c13; optional int64 c14; } } } }
If we populate a parquet-based table using Hive with just this one column and a single row we get the correct result when reading the values back using Hive. Impala though returns values from a higher definition level if they are stored on the same position and are of the same type. In the example below we have populated the fields with values matching the field names for clarity, e.g. field c13 holds the integer 13.
Query in Hive:
SELECT array_column[0].c11.c21, array_column[0].c11.c22, array_column[0].c11.c23, array_column[0].c11.c24, array_column[0].c12, array_column[0].c13, array_column[0].c14 FROM sample;
Result from Hive (as expected):
c21 c22 c23 c24 c12 c13 c14 21 22 23 24 12 13 14
Query in Impala:
SELECT c11.c21 AS c21, c11.c22 AS c22, c11.c23 AS c23, c11.c24 AS c24, c12, c13, c14 FROM sample.array_column;
Result from Impala (incorrect):
c21 c22 c23 c24 c12 c13 c14 21 12 23 14 12 13 14
As can be seen above the value of c22 (which should be 22) is 12. The value is taken from c12. Both types are int32. Also the value of c24 (which should be 24) is 14. The value is taken from c14. Both types are int64. Also note that the value of c23 is correct. It has a different type than the corresponding field at a higher definition level. The pattern here seems to be that if there is a value at a higher definition level of the same type on the same position it is preferred by the reader rather than the correct value.
We have only seen this issue when the struct c11 is placed as the first field in the outer struct.
Attaching a sample parquet file. Statement for issuing a table on top of that file is:
CREATE EXTERNAL TABLE sample ( array_column array<struct<c11:struct<c21:int,c22:int,c23:int,c24:bigint>,c12:int,c13:bigint,c14:bigint>> ) STORED AS PARQUET LOCATION "/path/to/sample";
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-6240 PARQUET_ARRAY_RESOLUTION query option not documented
- Closed