Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1202

Add differentiation of nested records with the same name

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.7.0, 1.8.2
    • None
    • parquet-avro
    • None

    Description

      Hello,

      While reading back a Parquet file produced with Spark, it appears the schema produced by Parquet-Avro is not valid.

      I consider the simple following piece of code:

      
      ParquetReader<GenericRecord> reader =
      
                   AvroParquetReader.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path.toUri())).build();
      
                   System.out.println(reader.read().getSchema());
      
      

      I get a stack lile:

      
      Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't redefine: value
      
             at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
      
             at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
      
             at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
      
             at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
      
             at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
      
             at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
      
             at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
      
             at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
      
             at org.apache.avro.Schema.toString(+Schema.java:324+)
      
             at org.apache.avro.Schema.toString(+Schema.java:314+)
      
      

       

      The issue seems the same as the one reported in:

      https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name

       

      It have been fixed in Spark-avro within:

      https://github.com/databricks/spark-avro/pull/73

      In our case, the parquet schema looks like:

      
      message spark_schema {
      	 optional group calculatedobjectinfomap (MAP) {
      		 repeated group key_value {
      			 required binary key (UTF8);
      			 optional group value {
      				 optional int64 calcobjid;
      				 optional int64 calcobjparentid;
      				 optional binary portfolioname (UTF8);
      				 optional binary portfolioscheme (UTF8);
      				 optional binary calcobjtype (UTF8);
      				 optional binary calcobjmnemonic (UTF8);
      				 optional binary calcobinstrumentype (UTF8);
      				 optional int64 calcobjectqty;
      				 optional binary calcobjboid (UTF8);
      				 optional binary analyticalfoldermnemonic (UTF8);
      				 optional binary calculatedidentifier (UTF8);
      				 optional binary calcobjlevel (UTF8);
      				 optional binary calcobjboidscheme (UTF8);
      			 }
      		}
      	}
      	optional group riskfactorinfomap (MAP) {
      		 repeated group key_value {
      			 required binary key (UTF8);
      			 optional group value {
      			 optional binary riskfactorname (UTF8);
      			 optional binary riskfactortype (UTF8);
      			 optional binary riskfactorrole (UTF8);
      			 }
      		 }
      	 }
      }
      
      

      We indeed have 2 Map field with a value fields named 'value'. The name 'value' is defaulted in org.apache.spark.sql.types.MapType.

      The fix seems not trivial given current parquet-avro code then I doubt I will be able to craft a valid PR without directions.

      Thanks,

      Attachments

        Activity

          People

            Unassigned Unassigned
            blasd Benoit Lacelle
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: