Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1202

Add differentiation of nested records with the same name

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.0, 1.8.2
    • Fix Version/s: None
    • Component/s: parquet-avro
    • Labels:
      None

      Description

      Hello,

      While reading back a Parquet file produced with Spark, it appears the schema produced by Parquet-Avro is not valid.

      I consider the simple following piece of code:

      
      ParquetReader<GenericRecord> reader =
      
                   AvroParquetReader.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path.toUri())).build();
      
                   System.out.println(reader.read().getSchema());
      
      

      I get a stack lile:

      
      Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't redefine: value
      
             at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
      
             at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
      
             at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
      
             at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
      
             at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
      
             at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
      
             at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
      
             at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
      
             at org.apache.avro.Schema.toString(+Schema.java:324+)
      
             at org.apache.avro.Schema.toString(+Schema.java:314+)
      
      

       

      The issue seems the same as the one reported in:

      https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name

       

      It have been fixed in Spark-avro within:

      https://github.com/databricks/spark-avro/pull/73

      In our case, the parquet schema looks like:

      
      message spark_schema {
      	 optional group calculatedobjectinfomap (MAP) {
      		 repeated group key_value {
      			 required binary key (UTF8);
      			 optional group value {
      				 optional int64 calcobjid;
      				 optional int64 calcobjparentid;
      				 optional binary portfolioname (UTF8);
      				 optional binary portfolioscheme (UTF8);
      				 optional binary calcobjtype (UTF8);
      				 optional binary calcobjmnemonic (UTF8);
      				 optional binary calcobinstrumentype (UTF8);
      				 optional int64 calcobjectqty;
      				 optional binary calcobjboid (UTF8);
      				 optional binary analyticalfoldermnemonic (UTF8);
      				 optional binary calculatedidentifier (UTF8);
      				 optional binary calcobjlevel (UTF8);
      				 optional binary calcobjboidscheme (UTF8);
      			 }
      		}
      	}
      	optional group riskfactorinfomap (MAP) {
      		 repeated group key_value {
      			 required binary key (UTF8);
      			 optional group value {
      			 optional binary riskfactorname (UTF8);
      			 optional binary riskfactortype (UTF8);
      			 optional binary riskfactorrole (UTF8);
      			 }
      		 }
      	 }
      }
      
      

      We indeed have 2 Map field with a value fields named 'value'. The name 'value' is defaulted in org.apache.spark.sql.types.MapType.

      The fix seems not trivial given current parquet-avro code then I doubt I will be able to craft a valid PR without directions.

      Thanks,

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              blasd Benoit Lacelle
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: