Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6635

Unsafe long to int conversion in UncompressedSplitLineReader and IndexOutOfBoundsException

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None

      Description

      LineRecordReader creates the unsplittable reader like so:

            in = new UncompressedSplitLineReader(
                fileIn, job, recordDelimiter, split.getLength());
      

      Split length goes to

        private long splitLength;
      

      At some point when reading the first line, fillBuffer does this:

        @Override
        protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
            throws IOException {
          int maxBytesToRead = buffer.length;
          if (totalBytesRead < splitLength) {
            maxBytesToRead = Math.min(maxBytesToRead,
                                      (int)(splitLength - totalBytesRead));
      

      which will be a negative number for large splits, and the subsequent dfs read will fail with a boundary check.

      java.lang.IndexOutOfBoundsException
              at java.nio.Buffer.checkBounds(Buffer.java:559)
              at java.nio.ByteBuffer.get(ByteBuffer.java:668)
              at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
              at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:172)
              at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:744)
              at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:800)
              at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:860)
              at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
              at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
              at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
              at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
              at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
              at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
      

      This has been reported here: https://issues.streamsets.com/browse/SDC-2229, also happens in Hive if very large text files are forced to be read in a single split (e.g. via header-skipping feature, or via set mapred.min.split.size=9999999999999999)

        Attachments

          Activity

            People

            • Assignee:
              djp Junping Du
              Reporter:
              sershe Sergey Shelukhin
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: