[PARQUET-1202] Add differentiation of nested records with the same name - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.7.0, 1.8.2
Fix Version/s: None
Component/s: parquet-avro
Labels:
None

Description

Hello,

While reading back a Parquet file produced with Spark, it appears the schema produced by Parquet-Avro is not valid.

I consider the simple following piece of code:


ParquetReader<GenericRecord> reader =

             AvroParquetReader.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path.toUri())).build();

             System.out.println(reader.read().getSchema());

I get a stack lile:


Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't redefine: value

       at org.apache.avro.Schema$Names.put(+Schema.java:1128+)

       at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)

       at org.apache.avro.Schema.toString(+Schema.java:324+)

       at org.apache.avro.Schema.toString(+Schema.java:314+)

The issue seems the same as the one reported in:

https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name

It have been fixed in Spark-avro within:

https://github.com/databricks/spark-avro/pull/73

In our case, the parquet schema looks like:


message spark_schema {
	 optional group calculatedobjectinfomap (MAP) {
		 repeated group key_value {
			 required binary key (UTF8);
			 optional group value {
				 optional int64 calcobjid;
				 optional int64 calcobjparentid;
				 optional binary portfolioname (UTF8);
				 optional binary portfolioscheme (UTF8);
				 optional binary calcobjtype (UTF8);
				 optional binary calcobjmnemonic (UTF8);
				 optional binary calcobinstrumentype (UTF8);
				 optional int64 calcobjectqty;
				 optional binary calcobjboid (UTF8);
				 optional binary analyticalfoldermnemonic (UTF8);
				 optional binary calculatedidentifier (UTF8);
				 optional binary calcobjlevel (UTF8);
				 optional binary calcobjboidscheme (UTF8);
			 }
		}
	}
	optional group riskfactorinfomap (MAP) {
		 repeated group key_value {
			 required binary key (UTF8);
			 optional group value {
			 optional binary riskfactorname (UTF8);
			 optional binary riskfactortype (UTF8);
			 optional binary riskfactorrole (UTF8);
			 }
		 }
	 }
}

We indeed have 2 Map field with a value fields named 'value'. The name 'value' is defaulted in org.apache.spark.sql.types.MapType.

The fix seems not trivial given current parquet-avro code then I doubt I will be able to craft a valid PR without directions.

Thanks,

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Benoit Lacelle

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Jan/18 09:12

Updated:: 23/Jun/24 03:30