[MAPREDUCE-6635] Unsafe long to int conversion in UncompressedSplitLineReader and IndexOutOfBoundsException - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
Component/s: None
Labels:
None

Target Version/s:

2.8.0, 2.7.3, 2.6.5

Description

LineRecordReader creates the unsplittable reader like so:

      in = new UncompressedSplitLineReader(
          fileIn, job, recordDelimiter, split.getLength());

Split length goes to

  private long splitLength;

At some point when reading the first line, fillBuffer does this:

  @Override
  protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
      throws IOException {
    int maxBytesToRead = buffer.length;
    if (totalBytesRead < splitLength) {
      maxBytesToRead = Math.min(maxBytesToRead,
                                (int)(splitLength - totalBytesRead));

which will be a negative number for large splits, and the subsequent dfs read will fail with a boundary check.

java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:559)
        at java.nio.ByteBuffer.get(ByteBuffer.java:668)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:172)
        at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:744)
        at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:800)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:860)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
        at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
        at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)

This has been reported here: https://issues.streamsets.com/browse/SDC-2229, also happens in Hive if very large text files are forced to be read in a single split (e.g. via header-skipping feature, or via set mapred.min.split.size=9999999999999999)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6635.patch
17/Feb/16 16:56
5 kB
Junping Du

Activity

People

Assignee:: Junping Du

Reporter:: Sergey Shelukhin

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 17/Feb/16 00:44

Updated:: 25/Oct/19 20:26

Resolved:: 23/Feb/16 09:18