Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1409

Can write but read parquet file with nested arrays

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.10.0
    • Fix Version/s: None
    • Component/s: parquet-avro
    • Labels:
      None

      Description

      I am trying to read a parquet file in scala using the Avro interface (1.10.). The file was also generated using the same interface.

      The data that I am writing looks like this:

       

      case class Inner(b: Array[Int])
      case class Outer(a: Array[Inner])
      
      val data = Outer(
      Array(
        Inner(Array(1, 2)),
        Inner(Array(3, 4))
       )
      )
      

       

      Using parquet-tools to read read the file looks like this:

       

      $ parquet-tools cat /tmp/test.parquet 
      a: 
      .array: 
      ..b: 
      ...array = 1 
      ...array = 2 
      .array: 
      ..b: 
      ...array = 3 
      ...array = 4
      

       

      But while trying to read the file I get the following exception:

       

       

       
      Exception in thread "main" org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'array' not found
      at org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
      at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130)
      at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
      at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
      at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
      at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536)
      at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486)
      at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
      at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
      at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
      at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
      at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
      at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
      at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
      at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
      at raw.runtime.writer.parquet.avro.Lixo$.main(Lixo.scala:78)
      at raw.runtime.writer.parquet.avro.Lixo.main(Lixo.scala)
      
       
      

       

      This is the code used to generate this file:

      val filename = "/tmp/test.parquet"
      val path = Paths.get(filename).toFile
      val conf = new Configuration()
      
      val schema: Schema = {
      
        val inner = Schema.createRecord("inner", "some doc", "outer", false,
          List(new Schema.Field("b", Schema.createArray(Schema.create(Schema.Type.INT)), "", null: Object)).asJava
        )
      
        Schema.createRecord("outer", "", "", false,
          List(new Schema.Field("a", Schema.createArray(inner), "", null: Object)).asJava
        )
      }
      
      val os = new FileOutputStream(path)
      
      val outputFile = new RawParquetOutputFile(os)
      val parquetWriter: ParquetWriter[GenericRecord] = AvroParquetWriter.builder[GenericRecord](outputFile)
        .withConf(conf)
        .withSchema(schema)
        .build()
      
      val data = Outer(
        Array(
          Inner(Array(1, 2)),
          Inner(Array(3, 4))
        )
      )
      
      val record = new GenericData.Record(schema)
      val fieldA = schema.getField("a").schema()
      val recorData = {
        val fieldAType = fieldA.getElementType()
        data.a.map { x =>
          val innerRecord = new GenericData.Record(fieldAType)
          innerRecord.put("b", x.b)
          innerRecord
        }
      }
      
      record.put("a", recorData)
      parquetWriter.write(record)
      parquetWriter.close()
      os.close()
      

       Also if I pass the configuration option 

      parquet.avro.add-list-element-records = false
      
      

      I get a different exception:

      org.apache.avro.SchemaParseException: Can't redefine: list

       

      Am I doing something wrong?

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Torcato cesar matos
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: