-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 1.7.0, 1.8.0, 1.8.1, 1.9.0
-
Component/s: parquet-avro
-
Labels:None
Found this issue while investigating SPARK-16344.
For the following Parquet schema
message root { optional group f (LIST) { repeated group list { optional group element { optional int64 element; } } } }
parquet-avro decodes it as something like this:
record SingleElement { int element; } record NestedSingleElement { SingleElement element; } record Spark16344Wrong { array<NestedSingleElement> f; }
while correct interpretation should be:
record SingleElement { int element; } record Spark16344 { array<SingleElement> f; }
The reason is that the element syntactic group for LIST in
<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> element; } }
is recognized as a record field named element. The problematic code lies in AvroRecordConverter.isElementType(). We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout.
- is caused by
-
PARQUET-1681 Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
-
- Open
-
- is related to
-
SPARK-16344 Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+
-
- Resolved
-
- links to