[IMPALA-4725] Wrong field resolution of nested Parquet fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
Fix Version/s: Impala 2.9.0
Component/s: Backend
Labels:
Environment:
CDH 5.8.3

Target Version:

Impala 2.9.0

Description

It seems like values from different definition levels gets mixed-up by the parquet reader when reading arrays of structs with nested structs if the scalar values in the structs are of the same type and have the same position.

Consider the following schema:

message hive_schema {
  optional group array_column (LIST) {
    repeated group bag {
      optional group array_element {
        optional group c11 {
          optional int32 c21;
          optional int32 c22;
          optional int32 c23;
          optional int64 c24;
        }
        optional int32 c12;
        optional int64 c13;
        optional int64 c14;
      }
    }
  }
}

If we populate a parquet-based table using Hive with just this one column and a single row we get the correct result when reading the values back using Hive. Impala though returns values from a higher definition level if they are stored on the same position and are of the same type. In the example below we have populated the fields with values matching the field names for clarity, e.g. field c13 holds the integer 13.

Query in Hive:

SELECT
  array_column[0].c11.c21,
  array_column[0].c11.c22,
  array_column[0].c11.c23,
  array_column[0].c11.c24,
  array_column[0].c12,
  array_column[0].c13,
  array_column[0].c14
FROM sample;

Result from Hive (as expected):

c21 c22 c23 c24 c12 c13 c14
21  22  23  24  12  13  14

Query in Impala:

SELECT
  c11.c21 AS c21,
  c11.c22 AS c22,
  c11.c23 AS c23,
  c11.c24 AS c24,
  c12,
  c13,
  c14
FROM sample.array_column;

Result from Impala (incorrect):

c21 c22 c23 c24 c12 c13 c14
21  12  23  14  12  13  14

As can be seen above the value of c22 (which should be 22) is 12. The value is taken from c12. Both types are int32. Also the value of c24 (which should be 24) is 14. The value is taken from c14. Both types are int64. Also note that the value of c23 is correct. It has a different type than the corresponding field at a higher definition level. The pattern here seems to be that if there is a value at a higher definition level of the same type on the same position it is preferred by the reader rather than the correct value.

We have only seen this issue when the struct c11 is placed as the first field in the outer struct.

Attaching a sample parquet file. Statement for issuing a table on top of that file is:

CREATE EXTERNAL TABLE sample (
array_column array<struct<c11:struct<c21:int,c22:int,c23:int,c24:bigint>,c12:int,c13:bigint,c14:bigint>>
)
STORED AS PARQUET LOCATION "/path/to/sample";

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sample.parq
04/Jan/17 14:52
1 kB
Petter von Dolwitz

Issue Links

is related to

IMPALA-6240 PARQUET_ARRAY_RESOLUTION query option not documented

Closed

Activity

People

Assignee:: Alexander Behm

Reporter:: Petter von Dolwitz

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Jan/17 14:52

Updated:: 22/Mar/18 23:06

Resolved:: 09/Mar/17 05:13