-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 1.7.0, 1.8.2
-
Fix Version/s: None
-
Component/s: parquet-avro
-
Labels:None
Hello,
While reading back a Parquet file produced with Spark, it appears the schema produced by Parquet-Avro is not valid.
I consider the simple following piece of code:
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path.toUri())).build(); System.out.println(reader.read().getSchema());
I get a stack lile:
Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't redefine: value
at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
at org.apache.avro.Schema.toString(+Schema.java:324+)
at org.apache.avro.Schema.toString(+Schema.java:314+)
The issue seems the same as the one reported in:
It have been fixed in Spark-avro within:
https://github.com/databricks/spark-avro/pull/73
In our case, the parquet schema looks like:
message spark_schema { optional group calculatedobjectinfomap (MAP) { repeated group key_value { required binary key (UTF8); optional group value { optional int64 calcobjid; optional int64 calcobjparentid; optional binary portfolioname (UTF8); optional binary portfolioscheme (UTF8); optional binary calcobjtype (UTF8); optional binary calcobjmnemonic (UTF8); optional binary calcobinstrumentype (UTF8); optional int64 calcobjectqty; optional binary calcobjboid (UTF8); optional binary analyticalfoldermnemonic (UTF8); optional binary calculatedidentifier (UTF8); optional binary calcobjlevel (UTF8); optional binary calcobjboidscheme (UTF8); } } } optional group riskfactorinfomap (MAP) { repeated group key_value { required binary key (UTF8); optional group value { optional binary riskfactorname (UTF8); optional binary riskfactortype (UTF8); optional binary riskfactorrole (UTF8); } } } }
We indeed have 2 Map field with a value fields named 'value'. The name 'value' is defaulted in org.apache.spark.sql.types.MapType.
The fix seems not trivial given current parquet-avro code then I doubt I will be able to craft a valid PR without directions.
Thanks,