Description
We have enabled the Read Optimization feature provided in ORC-1138Seek vs Read Optimization for Spark program, and configured it as follows:
spark.hadoop.orc.min.disk.seek.size=134217728 spark.hadoop.orc.min.disk.seek.size.tolerance=100
When reading an extremely large ORC file (17.3GB), the program encountered an java.nio.BufferOverflowException. After adding some additional log information, we obtained the following stack trace:
23/11/23 18:59:09 INFO Executor: Finished task 91.0 in stage 0.0 (TID 83). 2866 bytes result sent to driver 23/11/23 18:59:09 INFO RecordReaderUtils: readBytes = 26099343, reqBytes = 25728 23/11/23 18:59:09 INFO RecordReaderUtils: copyLength = 77960 23/11/23 18:59:09 INFO RecordReaderUtils: newBuffer.remaining() = 8470 23/11/23 18:59:09 INFO RecordReaderUtils: BufferChunk begin: 23/11/23 18:59:09 INFO RecordReaderUtils: data range [15032282586, 15032298848), size: 16262 type: array-backed 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15032298848, cf = 2147483647, ef = 15032298848 23/11/23 18:59:09 INFO RecordReaderUtils: data range [15032298848, 15032299844), size: 996 type: array-backed 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15032299844, cf = 2147483647, ef = 15032299844 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15032299844, 15032377804) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15032377804, cf = 2147483647, ef = 15032377804 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058260587, 15058261632) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058261632, cf = 15058260587, ef = 15058261632 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058261632, 15058288409) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058288409, cf = 15058260587, ef = 15058288409 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058288409, 15058288862) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058288862, cf = 15058260587, ef = 15058288862 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058339730, 15058340775) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058340775, cf = 15058339730, ef = 15058340775 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058340775, 15058342439) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058342439, cf = 15058339730, ef = 15058342439 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058449794, 15058449982) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058449982, cf = 15058449794, ef = 15058449982 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058449982, 15058451700) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058451700, cf = 15058449794, ef = 15058451700 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058451700, 15058451749) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058451749, cf = 15058449794, ef = 15058451749 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058484358, 15058484422) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058484422, cf = 15058484358, ef = 15058484422 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058484422, 15058484862) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058484862, cf = 15058484358, ef = 15058484862 23/11/23 18:59:09 INFO RecordReaderUtils: data range[15058484862, 15058484878) 23/11/23 18:59:09 INFO RecordReaderUtils: f = 2147483647, e = 15058484878, cf = 15058484358, ef = 15058484878 23/11/23 18:59:09 INFO RecordReaderUtils: BufferChunk end. reqBytes2 = 25728 23/11/23 18:59:09 ERROR Executor: Exception in task 111.0 in stage 0.0 (TID 84) java.nio.BufferOverflowException at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:189) at org.apache.orc.impl.RecordReaderUtils$ChunkReader.populateChunksReduceSize(RecordReaderUtils.java:725) at org.apache.orc.impl.RecordReaderUtils$ChunkReader.populateChunks(RecordReaderUtils.java:677) at org.apache.orc.impl.RecordReaderUtils$ChunkReader.readRanges(RecordReaderUtils.java:801) at org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:539) at org.apache.orc.impl.RecordReaderUtils.access$100(RecordReaderUtils.java:45) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:109) at org.apache.orc.impl.reader.StripePlanner.readData(StripePlanner.java:181) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1257) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1298) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1341) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1388) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:205) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:561) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 23/11/23 18:59:11 INFO YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown 23/11/23 18:59:11 INFO MemoryStore: MemoryStore cleared 23/11/23 18:59:11 INFO BlockManager: BlockManager stopped 23/11/23 18:59:11 INFO ShutdownHookManager: Shutdown hook called
After identifying the issue, we discovered that in the org.apache.orc.impl.RecordReaderUtils.ChunkReader#create method, if the BufferChunk offset value exceeds Integer.MAX_VALUE, the ChunkReader is assigned incorrect values for reqBytes and readBytes. This results in a BufferOverflowException during the subsequent ChunkReader#readRanges .