Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.7.0, 1.8.2
-
None
-
None
Description
Hello,
While reading back a Parquet file produced with Spark, it appears the schema produced by Parquet-Avro is not valid.
I consider the simple following piece of code:
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path.toUri())).build(); System.out.println(reader.read().getSchema());
I get a stack lile:
Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't redefine: value
at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
at org.apache.avro.Schema.toString(+Schema.java:324+)
at org.apache.avro.Schema.toString(+Schema.java:314+)
The issue seems the same as the one reported in:
It have been fixed in Spark-avro within:
https://github.com/databricks/spark-avro/pull/73
In our case, the parquet schema looks like:
message spark_schema { optional group calculatedobjectinfomap (MAP) { repeated group key_value { required binary key (UTF8); optional group value { optional int64 calcobjid; optional int64 calcobjparentid; optional binary portfolioname (UTF8); optional binary portfolioscheme (UTF8); optional binary calcobjtype (UTF8); optional binary calcobjmnemonic (UTF8); optional binary calcobinstrumentype (UTF8); optional int64 calcobjectqty; optional binary calcobjboid (UTF8); optional binary analyticalfoldermnemonic (UTF8); optional binary calculatedidentifier (UTF8); optional binary calcobjlevel (UTF8); optional binary calcobjboidscheme (UTF8); } } } optional group riskfactorinfomap (MAP) { repeated group key_value { required binary key (UTF8); optional group value { optional binary riskfactorname (UTF8); optional binary riskfactortype (UTF8); optional binary riskfactorrole (UTF8); } } } }
We indeed have 2 Map field with a value fields named 'value'. The name 'value' is defaulted in org.apache.spark.sql.types.MapType.
The fix seems not trivial given current parquet-avro code then I doubt I will be able to craft a valid PR without directions.
Thanks,