It seems like values from different definition levels gets mixed-up by the parquet reader when reading arrays of structs with nested structs if the scalar values in the structs are of the same type and have the same position.
Consider the following schema:
If we populate a parquet-based table using Hive with just this one column and a single row we get the correct result when reading the values back using Hive. Impala though returns values from a higher definition level if they are stored on the same position and are of the same type. In the example below we have populated the fields with values matching the field names for clarity, e.g. field c13 holds the integer 13.
Query in Hive:
Result from Hive (as expected):
Query in Impala:
Result from Impala (incorrect):
As can be seen above the value of c22 (which should be 22) is 12. The value is taken from c12. Both types are int32. Also the value of c24 (which should be 24) is 14. The value is taken from c14. Both types are int64. Also note that the value of c23 is correct. It has a different type than the corresponding field at a higher definition level. The pattern here seems to be that if there is a value at a higher definition level of the same type on the same position it is preferred by the reader rather than the correct value.
We have only seen this issue when the struct c11 is placed as the first field in the outer struct.
Attaching a sample parquet file. Statement for issuing a table on top of that file is: