Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6635

Unsafe long to int conversion in UncompressedSplitLineReader and IndexOutOfBoundsException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
    • None
    • None

    Description

      LineRecordReader creates the unsplittable reader like so:

            in = new UncompressedSplitLineReader(
                fileIn, job, recordDelimiter, split.getLength());
      

      Split length goes to

        private long splitLength;
      

      At some point when reading the first line, fillBuffer does this:

        @Override
        protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
            throws IOException {
          int maxBytesToRead = buffer.length;
          if (totalBytesRead < splitLength) {
            maxBytesToRead = Math.min(maxBytesToRead,
                                      (int)(splitLength - totalBytesRead));
      

      which will be a negative number for large splits, and the subsequent dfs read will fail with a boundary check.

      java.lang.IndexOutOfBoundsException
              at java.nio.Buffer.checkBounds(Buffer.java:559)
              at java.nio.ByteBuffer.get(ByteBuffer.java:668)
              at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
              at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:172)
              at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:744)
              at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:800)
              at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:860)
              at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
              at java.io.DataInputStream.read(DataInputStream.java:149)
              at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
              at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
              at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
              at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
              at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
              at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
      

      This has been reported here: https://issues.streamsets.com/browse/SDC-2229, also happens in Hive if very large text files are forced to be read in a single split (e.g. via header-skipping feature, or via set mapred.min.split.size=9999999999999999)

      Attachments

        1. MAPREDUCE-6635.patch
          5 kB
          Junping Du

        Activity

          People

            junping_du Junping Du
            sershe Sergey Shelukhin
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: