Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3647

Corner-case in IFile leads to failed tasks

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None

      Description

      A couple of reduce tasks failed at IFile.Reader.next, one with:

      java.lang.NegativeArraySizeException
         at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:246)
         at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:298)
         at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134)
         at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:225)
         at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
         at org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:720)
         at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:679)
         at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:225)
         at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316)
         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2157)
      

      On a related note, another failed at:

      IFile.java:380
            // Position for the next record
            long skipped = dataIn.skip(recordLength);
            if (skipped != recordLength) {
              throw new IOException("Failed to skip past record of length: " + 
                                    recordLength);
            }
      

      where recordLength was -17.

      1. HADOOP-3647_0_20080708.patch
        17 kB
        Arun C Murthy
      2. HADOOP-3647_1_20080710.patch
        18 kB
        Arun C Murthy
      3. hadoop-logs.tar.gz
        132 kB
        Per Jacobsson

        Activity

        Hide
        acmurthy Arun C Murthy added a comment -

        This has been observed only on sort500, I've tried in-vain to reproduce this on smaller clusters...

        Show
        acmurthy Arun C Murthy added a comment - This has been observed only on sort500, I've tried in-vain to reproduce this on smaller clusters...
        Hide
        acmurthy Arun C Murthy added a comment -

        I've tried in-vain to reproduce this over the past week, as such here is a 'debug patch' to help us track this better in the future. I propose we commit this for now and keep an eye on this henceforth...

        Show
        acmurthy Arun C Murthy added a comment - I've tried in-vain to reproduce this over the past week, as such here is a 'debug patch' to help us track this better in the future. I propose we commit this for now and keep an eye on this henceforth...
        Hide
        acmurthy Arun C Murthy added a comment -

        Forgot to add that the patch has the following 'debug' properties:
        1. Logs the first key & value lengths for intermediate map-outputs both on the serving end (jetty) and the client end (reducer's shuffle)
        2. Dumps the in-memory buffer to disk in-case of the observed exceptions.

        Also, I've fixed a possibly subtle bug with flushing output-streams for compressed map-outputs in IFile.Writer.close().

        Show
        acmurthy Arun C Murthy added a comment - Forgot to add that the patch has the following 'debug' properties: 1. Logs the first key & value lengths for intermediate map-outputs both on the serving end (jetty) and the client end (reducer's shuffle) 2. Dumps the in-memory buffer to disk in-case of the observed exceptions. Also, I've fixed a possibly subtle bug with flushing output-streams for compressed map-outputs in IFile.Writer.close().
        Hide
        acmurthy Arun C Murthy added a comment -

        Resubmitting to hudson...

        Show
        acmurthy Arun C Murthy added a comment - Resubmitting to hudson...
        Hide
        devaraj Devaraj Das added a comment -

        I have two comments:
        1) The second seek in TaskTracker::doGet seems expensive esp because it is only for debugging. It can be optimized to get the same info.
        2) The comment " // TODO: Remove this after a 'fix' for HADOOP-3647 // WARN: This won't work for compressed map-outputs!" should be removed from ReduceTask.java.

        Show
        devaraj Devaraj Das added a comment - I have two comments: 1) The second seek in TaskTracker::doGet seems expensive esp because it is only for debugging. It can be optimized to get the same info. 2) The comment " // TODO: Remove this after a 'fix' for HADOOP-3647 // WARN: This won't work for compressed map-outputs!" should be removed from ReduceTask.java.
        Hide
        acmurthy Arun C Murthy added a comment -

        Updated patch, this optimizes the seeks.

        Show
        acmurthy Arun C Murthy added a comment - Updated patch, this optimizes the seeks.
        Hide
        devaraj Devaraj Das added a comment -

        +1

        Show
        devaraj Devaraj Das added a comment - +1
        Hide
        acmurthy Arun C Murthy added a comment -

        Hudson seems be to in strife... I've verified that patch passes all unit tests locally and 'ant test-patch' doesn't balk too.

             [exec] -1 overall.     [exec]     +1 @author.  The patch does not contain any @author tags.     
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.     
             [exec]                         Please justify why no tests are needed for this patch.     
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.     
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
        

        This patch only adds some logging, doesn't merit a test-case.

        Show
        acmurthy Arun C Murthy added a comment - Hudson seems be to in strife... I've verified that patch passes all unit tests locally and 'ant test-patch' doesn't balk too. [exec] -1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. This patch only adds some logging, doesn't merit a test-case.
        Hide
        acmurthy Arun C Murthy added a comment -

        I just committed this.

        Show
        acmurthy Arun C Murthy added a comment - I just committed this.
        Hide
        hudson Hudson added a comment -
        Show
        hudson Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
        Hide
        pjacobsson Per Jacobsson added a comment -

        Attaching error logs from a failed run that could be related to this issue.

        Show
        pjacobsson Per Jacobsson added a comment - Attaching error logs from a failed run that could be related to this issue.

          People

          • Assignee:
            acmurthy Arun C Murthy
            Reporter:
            acmurthy Arun C Murthy
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development