Details
Description
LineRecordReader creates the unsplittable reader like so:
in = new UncompressedSplitLineReader( fileIn, job, recordDelimiter, split.getLength());
Split length goes to
private long splitLength;
At some point when reading the first line, fillBuffer does this:
@Override protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter) throws IOException { int maxBytesToRead = buffer.length; if (totalBytesRead < splitLength) { maxBytesToRead = Math.min(maxBytesToRead, (int)(splitLength - totalBytesRead));
which will be a negative number for large splits, and the subsequent dfs read will fail with a boundary check.
java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Buffer.java:559) at java.nio.ByteBuffer.get(ByteBuffer.java:668) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279) at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:172) at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:744) at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:800) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:860) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
This has been reported here: https://issues.streamsets.com/browse/SDC-2229, also happens in Hive if very large text files are forced to be read in a single split (e.g. via header-skipping feature, or via set mapred.min.split.size=9999999999999999)