Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11002

XmlIO buffer overflow exception

Details

    • Bug
    • Status: Resolved
    • P1
    • Resolution: Fixed
    • 2.23.0, 2.24.0
    • Missing
    • io-java-xml

    Description

      We're making using of Apache Beam in Google Dataflow.
      We're using XmlIO to read in an XML file with such a setup

      pipeline
                          .apply("Read Storage Bucket",
                                  XmlIO.read<XmlProduct>()
                                          .from(sourcePath)
                                          .withRootElement(xmlProductRoot)
                                          .withRecordElement(xmlProductRecord)
                                          .withRecordClass(XmlProduct::class.java)
                          )
      

      However, from time to time, we're getting buffer overflow exception from reading random xml files:

      "Error message from worker: java.io.IOException: Failed to start reading from source: gs://path-to-xml-file.xml range [1722550, 2684411)
      	org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610)
      	org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359)
      	org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
      	org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
      	org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
      	org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
      	org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
      	org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
      	org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
      	org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
      	org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
      	java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.nio.BufferOverflowException
      	java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662)
      	java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196)
      	org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285)
      	org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192)
      	org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
      	org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249)
      	org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607)
      	... 14 more
      

      We can't reproduce this buffer overflow exception locally with the DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run correctly without any exceptions.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dlew Duncan Lew
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: