Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-864

parquet schema conflict: optional binary <some-field> (UTF8) is not a group

    XMLWordPrintableJSON

Details

    Description

      When dealing with struct types like this

      {
        "type": "struct",
        "fields": [
          {
            "name": "categoryResults",
            "type": {
              "type": "array",
              "elementType": {
                "type": "struct",
                "fields": [
                  {
                    "name": "categoryId",
                    "type": "string",
                    "nullable": true,
                    "metadata": {}
                  }
                ]
              },
              "containsNull": true
            },
            "nullable": true,
            "metadata": {}
          }
        ]
      }
      

      The second ingest batch throws that exception:

      ERROR [Executor task launch worker for task 15] commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error upserting bucketType UPDATE for partition :0
      org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
      	at org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
      	at org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
      	at org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
      	at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
      	at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
      	at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
      	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
      	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
      	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
      	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
      	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
      	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
      	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
      	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
      	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      	at org.apache.spark.scheduler.Task.run(Task.scala:123)
      	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
      	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
      	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143)
      	at org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:98)
      	... 34 more
      Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed
      	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
      	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
      	... 35 more
      Caused by: org.apache.hudi.exception.HoodieException: operation has failed
      	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.throwExceptionIfFailed(BoundedInMemoryQueue.java:227)
      	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.readNextRecord(BoundedInMemoryQueue.java:206)
      	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.access$100(BoundedInMemoryQueue.java:52)
      	at org.apache.hudi.common.util.queue.BoundedInMemoryQueue$QueueIterator.hasNext(BoundedInMemoryQueue.java:257)
      	at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:36)
      	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	... 3 more
      Caused by: java.lang.ClassCastException: optional binary categoryId (UTF8) is not a group
      	at org.apache.parquet.schema.Type.asGroupType(Type.java:207)
      	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
      	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
      	at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
      	at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536)
      	at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486)
      	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
      	at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
      	at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
      	at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
      	at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
      	at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
      	at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
      	at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
      	at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
      	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
      	at org.apache.hudi.client.utils.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
      	at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
      	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	... 4 more
      

      Parquet schema of the failing struct

      optional group categoryResults (LIST) {
        repeated group array {
          optional binary categoryId (UTF8);
        }
      }
      

      When the leaf record has multiple fields the issue has gone. I assume that this issue relates to either parquet/avro. Following array of struct definition is handled fine withtout exception:

          optional group productResult (LIST) {
            repeated group array {
              optional binary productId (UTF8);
              optional boolean productImages;
              optional binary productShortDescription (UTF8);
            }
          }
      

      Attachments

        Issue Links

          Activity

            People

              alexey.kudinkin Alexey Kudinkin
              rolandjohann Roland Johann
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: