Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.7.0, 1.8.0, 1.8.1, 1.9.0
-
None
Description
Found this issue while investigating SPARK-16344.
For the following Parquet schema
message root { optional group f (LIST) { repeated group list { optional group element { optional int64 element; } } } }
parquet-avro decodes it as something like this:
record SingleElement { int element; } record NestedSingleElement { SingleElement element; } record Spark16344Wrong { array<NestedSingleElement> f; }
while correct interpretation should be:
record SingleElement { int element; } record Spark16344 { array<SingleElement> f; }
The reason is that the element syntactic group for LIST in
<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> element; } }
is recognized as a record field named element. The problematic code lies in AvroRecordConverter.isElementType(). We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout.
Attachments
Issue Links
- is caused by
-
PARQUET-1681 Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
- Open
- is related to
-
SPARK-16344 Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+
- Resolved
- links to