Details
-
Bug
-
Status: Resolved
-
P1
-
Resolution: Fixed
-
2.23.0, 2.24.0
Description
We're making using of Apache Beam in Google Dataflow.
We're using XmlIO to read in an XML file with such a setup
pipeline .apply("Read Storage Bucket", XmlIO.read<XmlProduct>() .from(sourcePath) .withRootElement(xmlProductRoot) .withRecordElement(xmlProductRecord) .withRecordClass(XmlProduct::class.java) )
However, from time to time, we're getting buffer overflow exception from reading random xml files:
"Error message from worker: java.io.IOException: Failed to start reading from source: gs://path-to-xml-file.xml range [1722550, 2684411) org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.nio.BufferOverflowException java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662) java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196) org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285) org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192) org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476) org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249) org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607) ... 14 more
We can't reproduce this buffer overflow exception locally with the DirectRunner. If we rerun the dataflow job in the Google Cloud, it can run correctly without any exceptions.