Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-430

Task stuck in cleanup with OutOfMemoryErrors

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Various code paths in the framework caught Throwable and tried to do inline cleanup. In case of OOM errors, such inline-cleanups can result into hung jvms. With this fix, the TaskTracker provides a api to report fatal errors (any throwable other than FSErrror and Exceptions). On catching a Throwable, Mapper/Reducer tries to inform the TT.
      Show
      Various code paths in the framework caught Throwable and tried to do inline cleanup. In case of OOM errors, such inline-cleanups can result into hung jvms. With this fix, the TaskTracker provides a api to report fatal errors (any throwable other than FSErrror and Exceptions). On catching a Throwable, Mapper/Reducer tries to inform the TT.

      Description

      Obesrved a task with OutOfMemory error, stuck in cleanup.

      1. MAPREDUCE-430-v1.8.patch
        8 kB
        Amar Kamat
      2. MAPREDUCE-430-v1.7.patch
        6 kB
        Amar Kamat
      3. MAPREDUCE-430-v1.6-branch-0.20.patch
        12 kB
        Amar Kamat
      4. MAPREDUCE-430-v1.6.patch
        12 kB
        Amar Kamat
      5. MAPREDUCE-430-v1.13-branch-0.20.patch
        18 kB
        Amar Kamat
      6. MAPREDUCE-430-v1.13.patch
        18 kB
        Amar Kamat
      7. MAPREDUCE-430-v1.12-branch-0.20.patch
        18 kB
        Amar Kamat
      8. MAPREDUCE-430-v1.12.patch
        18 kB
        Amar Kamat
      9. MAPREDUCE-430-v1.11.patch
        18 kB
        Amar Kamat
      10. MAPREDUCE-430.FI.patch
        0.6 kB
        Konstantin Boudnik
      11. MAPRED-430-v1.13-example.patch
        3 kB
        Amar Kamat

        Activity

        Amareshwari Sriramadasu created issue -
        Hide
        Amareshwari Sriramadasu added a comment -

        Task logs show :
        2009-06-12 21:37:11,574 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
        java.io.IOException: Task: attempt_200905250540_16139_r_000176_0 - The reduce copier failed
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:382)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
        Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
        at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
        at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:184)
        at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:376)
        at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:338)
        at org.apache.hadoop.mapred.Merger.merge(Merger.java:60)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2460)
        2009-06-12 21:37:11,597 INFO org.apache.hadoop.mapred.ReduceTask: Read 43883500 bytes from map-output for attempt_200905250540_16139_m_001504_0
        2009-06-12 21:37:11,696 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task

        Similar to FSError, for any error Task should inform TT and get killed forcefully. Currently, Child catches Throwable and does cleanup, shutting log Manager etc. We should catch only Exception here. And Errors should be informed to TT.
        Thoughts?

        Show
        Amareshwari Sriramadasu added a comment - Task logs show : 2009-06-12 21:37:11,574 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Task: attempt_200905250540_16139_r_000176_0 - The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:382) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:184) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:376) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:338) at org.apache.hadoop.mapred.Merger.merge(Merger.java:60) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2460) 2009-06-12 21:37:11,597 INFO org.apache.hadoop.mapred.ReduceTask: Read 43883500 bytes from map-output for attempt_200905250540_16139_m_001504_0 2009-06-12 21:37:11,696 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Similar to FSError, for any error Task should inform TT and get killed forcefully. Currently, Child catches Throwable and does cleanup, shutting log Manager etc. We should catch only Exception here. And Errors should be informed to TT. Thoughts?
        Owen O'Malley made changes -
        Field Original Value New Value
        Project Hadoop Common [ 12310240 ] Hadoop Map/Reduce [ 12310941 ]
        Key HADOOP-6088 MAPREDUCE-430
        Component/s mapred [ 12310690 ]
        Fix Version/s 0.21.0 [ 12313563 ]
        Hide
        Amar Kamat added a comment -

        Attaching a patch. Changes are as follows

        1. tasks report all the errors to the tasktracker that issues a kill and report to the jobtracker. jobtracker launches a cleanup attempt
        2. does inline cleaning for exceptions.

        Result of test-patch
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Running ant-tests now. Testing in progress.

        Show
        Amar Kamat added a comment - Attaching a patch. Changes are as follows tasks report all the errors to the tasktracker that issues a kill and report to the jobtracker. jobtracker launches a cleanup attempt does inline cleaning for exceptions . Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Running ant-tests now. Testing in progress.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.6.patch [ 12412712 ]
        Hide
        Amar Kamat added a comment -

        TestStreamingExitStatus failed on contrib.

        Show
        Amar Kamat added a comment - TestStreamingExitStatus failed on contrib.
        Hide
        Amareshwari Sriramadasu added a comment -

        changes look fine to me

        Show
        Amareshwari Sriramadasu added a comment - changes look fine to me
        Hide
        Amar Kamat added a comment -

        MAPREDUCE-587 is opened for TestStreamingExitStatus failure.

        Show
        Amar Kamat added a comment - MAPREDUCE-587 is opened for TestStreamingExitStatus failure.
        Hide
        Amar Kamat added a comment -

        Attaching a patch for branch-0.20.

        Show
        Amar Kamat added a comment - Attaching a patch for branch-0.20.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.6-branch-0.20.patch [ 12413283 ]
        Hide
        Sharad Agarwal added a comment -

        I just committed to trunk and 0.20 branch. Thanks Amar!

        Show
        Sharad Agarwal added a comment - I just committed to trunk and 0.20 branch. Thanks Amar!
        Sharad Agarwal made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Fix Version/s 0.20.1 [ 12314047 ]
        Resolution Fixed [ 1 ]
        Sharad Agarwal made changes -
        Assignee Amar Kamat [ amar_kamat ]
        Hide
        Arun C Murthy added a comment -

        Apologies for seeing this late, but:

        Did the task which threw up OOMError eventually time-out and was it declared as 'FAILED' ?


        -    } catch (FSError e) {
        -      LOG.fatal("FSError", e);
        +    } catch (Error e) {
        +      String error = "Error";
        +      if (e instanceof FSError) {
        +        error = "FSError";
        +      }
        

        This is a bad idiom, we can just catch FSError and Error separately, no?


        I also don't understand why we removed TaskUmbilicalProtocol.fsError and TaskUmbilicalProtocol.shuffleError in favour of a single TaskUmbilicalProtocol.taskError - having specific information about the reason for task failure is important, in future we should be using the specific information to drive decisions such as #retries for the task, tracker blacklisting etc. For e.g. errors such as OOM, process-tree running out of memory (high-ram jobs) shouldn't count against the TaskTracker. Hence, having more information about the specific error is good.

        Show
        Arun C Murthy added a comment - Apologies for seeing this late, but: Did the task which threw up OOMError eventually time-out and was it declared as 'FAILED' ? - } catch (FSError e) { - LOG.fatal( "FSError" , e); + } catch (Error e) { + String error = "Error" ; + if (e instanceof FSError) { + error = "FSError" ; + } This is a bad idiom, we can just catch FSError and Error separately, no? I also don't understand why we removed TaskUmbilicalProtocol.fsError and TaskUmbilicalProtocol.shuffleError in favour of a single TaskUmbilicalProtocol.taskError - having specific information about the reason for task failure is important, in future we should be using the specific information to drive decisions such as #retries for the task, tracker blacklisting etc. For e.g. errors such as OOM, process-tree running out of memory (high-ram jobs) shouldn't count against the TaskTracker. Hence, having more information about the specific error is good .
        Hide
        Devaraj Das added a comment -

        I agree with Arun. We should fix this patch.

        Show
        Devaraj Das added a comment - I agree with Arun. We should fix this patch.
        Hide
        Arun C Murthy added a comment -

        Did the task which threw up OOMError eventually time-out and was it declared as 'FAILED' ?

        Let me elaborate, I'm worried we have other bugs in the framework such as assigning the cleanup task to a jvm which is already dead and so on. I'd appreciate more details...

        Show
        Arun C Murthy added a comment - Did the task which threw up OOMError eventually time-out and was it declared as 'FAILED' ? Let me elaborate, I'm worried we have other bugs in the framework such as assigning the cleanup task to a jvm which is already dead and so on. I'd appreciate more details...
        Hide
        Amareshwari Sriramadasu added a comment -

        Did the task which threw up OOMError eventually time-out and was it declared as 'FAILED' ?

        No. The task alive and stuck. I think pingThread was sending updates without any problem.

        I'm worried we have other bugs in the framework such as assigning the cleanup task to a jvm which is already dead and so on.

        This would never happen, because task assignment happens when jvm comes back asking for a task using TaskUmbilicalProtocol.getTask

        Show
        Amareshwari Sriramadasu added a comment - Did the task which threw up OOMError eventually time-out and was it declared as 'FAILED' ? No. The task alive and stuck. I think pingThread was sending updates without any problem. I'm worried we have other bugs in the framework such as assigning the cleanup task to a jvm which is already dead and so on. This would never happen, because task assignment happens when jvm comes back asking for a task using TaskUmbilicalProtocol.getTask
        Hide
        Sharad Agarwal added a comment -

        Reverted the patch until the concerns are addressed.

        Show
        Sharad Agarwal added a comment - Reverted the patch until the concerns are addressed.
        Sharad Agarwal made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        Arun C Murthy added a comment -

        No. The task alive and stuck. I think pingThread was sending updates without any problem.

        That worries me. Do we know why the jvm did not shutdown? Where did we catch and ignore the OOMError? Is the ping thread not a daemon-thread?

        Show
        Arun C Murthy added a comment - No. The task alive and stuck. I think pingThread was sending updates without any problem. That worries me. Do we know why the jvm did not shutdown? Where did we catch and ignore the OOMError? Is the ping thread not a daemon-thread?
        Hide
        Amareshwari Sriramadasu added a comment -

        Where did we catch and ignore the OOMError?

        We did not ignore OOMError, it went up till Child.java. And child is trying to do cleanup before exiting, where it got stuck.

        Is the ping thread not a daemon-thread?

        No. It is daemon thread.

        Show
        Amareshwari Sriramadasu added a comment - Where did we catch and ignore the OOMError? We did not ignore OOMError, it went up till Child.java. And child is trying to do cleanup before exiting, where it got stuck. Is the ping thread not a daemon-thread? No. It is daemon thread.
        Hide
        Devaraj Das added a comment -

        I propose that we catch Exception instead of Throwable in Child.java. Whatever is currently done inside the catch Throwable block is retained (just that the block will get executed if an Exception is caught). That's the only change we do in the patch.

        Show
        Devaraj Das added a comment - I propose that we catch Exception instead of Throwable in Child.java. Whatever is currently done inside the catch Throwable block is retained (just that the block will get executed if an Exception is caught). That's the only change we do in the patch.
        Hide
        Amar Kamat added a comment -

        Attaching a patch that handles exceptions. Result of test-patch
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        All tests except TestRecoveryManager (timeout) and TestReduceFetchFromPartialMem (timeout) passed.

        Show
        Amar Kamat added a comment - Attaching a patch that handles exceptions. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All tests except TestRecoveryManager (timeout) and TestReduceFetchFromPartialMem (timeout) passed.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.7.patch [ 12416767 ]
        Amar Kamat made changes -
        Status Reopened [ 4 ] Patch Available [ 10002 ]
        Hadoop Flags [Reviewed]
        Hide
        Amareshwari Sriramadasu added a comment -

        catch block around task cleanup should also catch Exception, not Throwable.
        Do we need changes in TaskRunner?

        Show
        Amareshwari Sriramadasu added a comment - catch block around task cleanup should also catch Exception, not Throwable. Do we need changes in TaskRunner?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12416767/MAPREDUCE-430-v1.7.patch
        against trunk revision 805081.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416767/MAPREDUCE-430-v1.7.patch against trunk revision 805081. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/488/console This message is automatically generated.
        Hide
        Amar Kamat added a comment -

        Attaching a patch that avoid finally being called upon errors. Result of test-patch
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Testing the patch.

        Show
        Amar Kamat added a comment - Attaching a patch that avoid finally being called upon errors. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Testing the patch.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.8.patch [ 12416988 ]
        Hide
        Amar Kamat added a comment -

        ant tests (core + core) passed except TestJobHistory.

        Show
        Amar Kamat added a comment - ant tests (core + core) passed except TestJobHistory.
        Hide
        Amar Kamat added a comment -

        ant tests (core + core) passed except TestJobHistory.

        I meant core + contrib

        Show
        Amar Kamat added a comment - ant tests (core + core) passed except TestJobHistory. I meant core + contrib
        Hide
        Amar Kamat added a comment -

        Spoke to Devaraj about this offline. So child can ignore Errors and catch only Exceptions and FSError. But ReduceTask and MapTask still catch Throwable and eat it up. This can be serious and hence we need to take care of this. There are multiple options

        • dont catch Throwable only catch Exceptions
        • catch Throwable and exit the jvm
        • catch OOM and exit the jvm.
        Show
        Amar Kamat added a comment - Spoke to Devaraj about this offline. So child can ignore Errors and catch only Exceptions and FSError. But ReduceTask and MapTask still catch Throwable and eat it up. This can be serious and hence we need to take care of this. There are multiple options dont catch Throwable only catch Exceptions catch Throwable and exit the jvm catch OOM and exit the jvm.
        Hide
        Arun C Murthy added a comment -

        I had a chat with Amar to explain my take on this:

        1. {Map|Reduce}

          Task shouldn't catch Throwable except for the parts where they deal with threads e.g. copier threads or merge threads in which case they should use a Throwable variable to save the error and check for it.

        2. Child should catch Throwable, inform the TaskTracker via TaskUmbilicalProtocal.fatalError or such and then exit.

        Thoughts?

        Show
        Arun C Murthy added a comment - I had a chat with Amar to explain my take on this: {Map|Reduce} Task shouldn't catch Throwable except for the parts where they deal with threads e.g. copier threads or merge threads in which case they should use a Throwable variable to save the error and check for it. Child should catch Throwable, inform the TaskTracker via TaskUmbilicalProtocal.fatalError or such and then exit. Thoughts?
        Arun C Murthy made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Arun C Murthy added a comment -

        A suggestion for a test-case for this bug:

        A job with 1 map and 1 reduce.

        {map|reduce}_0_0 should throw OOM, {map|reduce}

        _0_1 should throw FSError,

        {map|reduce}_0_2 should throw IOException and {map|reduce}

        _0_3 should succeed. We can probably add a check to ensure that the job completed under a couple of mins or so also.

        Show
        Arun C Murthy added a comment - A suggestion for a test-case for this bug: A job with 1 map and 1 reduce. {map|reduce}_0_0 should throw OOM, {map|reduce} _0_1 should throw FSError, {map|reduce}_0_2 should throw IOException and {map|reduce} _0_3 should succeed. We can probably add a check to ensure that the job completed under a couple of mins or so also.
        Hide
        Devaraj Das added a comment -

        I am worried that the communication to the TT might get stuck if it was a OOM. On the other hand, System.exit has a much better probability of making the process exit. So my thinking for Child.java is:

        try

        { // do the existing stuff }

        catch (OutOfMemoryError)

        { System.exit(-1); }

        catch (FSError fse)

        { // do the existing stuff }

        catch (Throwable t )

        { // do the existing stuff } finally { // do the existing stuff }
        Show
        Devaraj Das added a comment - I am worried that the communication to the TT might get stuck if it was a OOM. On the other hand, System.exit has a much better probability of making the process exit. So my thinking for Child.java is: try { // do the existing stuff } catch (OutOfMemoryError) { System.exit(-1); } catch (FSError fse) { // do the existing stuff } catch (Throwable t ) { // do the existing stuff } finally { // do the existing stuff }
        Hide
        Arun C Murthy added a comment -

        I'm thinking we can do:

        try {
          // existing stuff
        } catch (FSError e) {
         // ...
        } catch (Throwable t) {
         umblical.fatalError();
        }
        

        Thus we can allow the Child to exit if it can't do umbilical.fatalError(). I'm thinking that we'll just do a best-effort to call

        {umbilical.fatalError}

        , I don't think the jvm will get 'stuck' there. Thoughts?

        Show
        Arun C Murthy added a comment - I'm thinking we can do: try { // existing stuff } catch (FSError e) { // ... } catch (Throwable t) { umblical.fatalError(); } Thus we can allow the Child to exit if it can't do umbilical.fatalError(). I'm thinking that we'll just do a best-effort to call {umbilical.fatalError} , I don't think the jvm will get 'stuck' there. Thoughts?
        Hide
        Devaraj Das added a comment -

        +1. Let's keep a watch on situations where the communication gets stuck due to an OOM..

        Show
        Devaraj Das added a comment - +1. Let's keep a watch on situations where the communication gets stuck due to an OOM..
        Hide
        Arun C Murthy added a comment -

        Yep, also the worst case is that the TT will kill the stuck JVM after mapred.task.timeout. But, yes it would be good to watch for this.

        Show
        Arun C Murthy added a comment - Yep, also the worst case is that the TT will kill the stuck JVM after mapred.task.timeout. But, yes it would be good to watch for this.
        Hide
        Amar Kamat added a comment -

        Attaching a patch that does what was last discussed last.
        This is what the patch does :

        • tasktracker now provides fatalError() to report fatal errors from child
        • Child/ReduceTask/MapTask now catches Throwable and invokes umbilical.fatalError(). If this fails, then System.exit() is invoked.

        Result of test-patch
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Running ant-tests.

        Show
        Amar Kamat added a comment - Attaching a patch that does what was last discussed last. This is what the patch does : tasktracker now provides fatalError() to report fatal errors from child Child/ReduceTask/MapTask now catches Throwable and invokes umbilical.fatalError(). If this fails, then System.exit() is invoked. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Running ant-tests.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.11.patch [ 12417451 ]
        Hide
        Amar Kamat added a comment -

        Attaching a new patch for review. Result of test-patch
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Show
        Amar Kamat added a comment - Attaching a new patch for review. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.12.patch [ 12417476 ]
        Attachment MAPREDUCE-430-v1.12-branch-0.20.patch [ 12417477 ]
        Hide
        Amar Kamat added a comment -

        All tests (core + contrib) passed except TestReduceFetch which timed out.

        Show
        Amar Kamat added a comment - All tests (core + contrib) passed except TestReduceFetch which timed out.
        Hide
        Arun C Murthy added a comment -

        I've been doing some thinking about the 'right' approach for handling exceptions and errors in the map/reduce tasks and did bounce some of these through Chris too:

        1. Every code path in the tasks' should propagate the exception/error upwards after doing any necessary clean-up in it's own components and sub-components
        2. We should distinguish between user errors (OOM, IOException etc.) and systemic errors (FSError, ChecksumError etc.) and define just two methods on the TaskUmbilicalProtocol: userError and systemError. In future these should be used to blacklist nodes only on 'systemError', not on 'userError'.
        3. Child.java:main should be the only place we call the methods on TaskUmbilicalProtocol to inform the parent TaskTracker about errors. It should unwrap the caught exception and
        4. All threads (shuffle copier threads, merger threads, sort/spill threads etc.) should catch Throwable and save the exception for the 'main' thread to examine. The 'main' thread should examine these at all appropriate places and abort correctly.
        5. We should never rethrow exceptions from the 'main' threads - rather we should 'wrap' them in appropriate exceptions and throw them with the right initCause. This is so that we don't lose the original stack traces.
        6. We should strive to use the same 'exception' types for the 'wrapper exceptions' whenever the exception is part of the signature e.g. IOException for map/reduce in the old api and IOException and InterruptedException for map/reduce in the new api (it is highly unfortunate that the RPC layer wraps InterruptedException in an IOException today! ). This is very important since the application writer might be relying on the 'right' exception for his specific error-handling needs. Thus we should wrap IOException/InterruptedException in an IOException and other Exceptions/Errors in a RuntimeException.

        Thoughts?

        Show
        Arun C Murthy added a comment - I've been doing some thinking about the 'right' approach for handling exceptions and errors in the map/reduce tasks and did bounce some of these through Chris too: Every code path in the tasks' should propagate the exception/error upwards after doing any necessary clean-up in it's own components and sub-components We should distinguish between user errors (OOM, IOException etc.) and systemic errors (FSError, ChecksumError etc.) and define just two methods on the TaskUmbilicalProtocol: userError and systemError. In future these should be used to blacklist nodes only on 'systemError', not on 'userError'. Child.java:main should be the only place we call the methods on TaskUmbilicalProtocol to inform the parent TaskTracker about errors. It should unwrap the caught exception and All threads (shuffle copier threads, merger threads, sort/spill threads etc.) should catch Throwable and save the exception for the 'main' thread to examine. The 'main' thread should examine these at all appropriate places and abort correctly. We should never rethrow exceptions from the 'main' threads - rather we should 'wrap' them in appropriate exceptions and throw them with the right initCause . This is so that we don't lose the original stack traces. We should strive to use the same 'exception' types for the 'wrapper exceptions' whenever the exception is part of the signature e.g. IOException for map/reduce in the old api and IOException and InterruptedException for map/reduce in the new api (it is highly unfortunate that the RPC layer wraps InterruptedException in an IOException today! ). This is very important since the application writer might be relying on the 'right' exception for his specific error-handling needs. Thus we should wrap IOException/InterruptedException in an IOException and other Exceptions/Errors in a RuntimeException. Thoughts?
        Hide
        Devaraj Das added a comment -

        I'd be happier with the previous approach for 0.20. I believe it doesn't break the existing behavior (I will look at the last patch in more detail today). This seems like a good first step towards a probably better fix for the issue as is proposed in the last comment. The last proposal seems like a much bigger change and I propose that we thrash that design/implementation out as part of 0.21. Thoughts?

        Show
        Devaraj Das added a comment - I'd be happier with the previous approach for 0.20. I believe it doesn't break the existing behavior (I will look at the last patch in more detail today). This seems like a good first step towards a probably better fix for the issue as is proposed in the last comment. The last proposal seems like a much bigger change and I propose that we thrash that design/implementation out as part of 0.21. Thoughts?
        Hide
        Devaraj Das added a comment -

        The code

                 LOG.fatal(STRING, throwable);
                 try {
                   umbilical.fatalError(getTaskID(), t);
                 } catch (IOException ioe) {
                   LOG.fatal("Failed to contact the tasktracker", t);
                   System.exit(-1);
                 }
        

        can be factored out to a method.. Other than that patch looks okay..

        Show
        Devaraj Das added a comment - The code LOG.fatal(STRING, throwable); try { umbilical.fatalError(getTaskID(), t); } catch (IOException ioe) { LOG.fatal("Failed to contact the tasktracker", t); System.exit(-1); } can be factored out to a method.. Other than that patch looks okay..
        Hide
        Aaron Kimball added a comment -

        Arun,

        Can you elaborate on the difference between "user" and "system" errors? I can imagine IOException getting thrown from within Hadoop internals just as much as from within a user's map() method. The distinction you mentioned above seems somewhat arbitrary to me at first glance.

        Show
        Aaron Kimball added a comment - Arun, Can you elaborate on the difference between "user" and "system" errors? I can imagine IOException getting thrown from within Hadoop internals just as much as from within a user's map() method. The distinction you mentioned above seems somewhat arbitrary to me at first glance.
        Hide
        Arun C Murthy added a comment -

        Aaron - I apologize for being unclear and assuming people got the context.

        The distinction mainly arises from the desire to weed out node/hardware failures as opposed to application errors. The the distinction is rooted in the desire to treat node/hardware errors (disk, corrupt RAM/NIC etc.) differently so as to quickly and accurately penalize the node (e.g. blacklist the tasktracker). Currently we use all task failures uniformly to penalize the tasktracker... clearly penalizing a tasktracker for an application error which results in an OOM (for e.g.) is injudicious. Does that make sense?

        Show
        Arun C Murthy added a comment - Aaron - I apologize for being unclear and assuming people got the context. The distinction mainly arises from the desire to weed out node/hardware failures as opposed to application errors. The the distinction is rooted in the desire to treat node/hardware errors (disk, corrupt RAM/NIC etc.) differently so as to quickly and accurately penalize the node (e.g. blacklist the tasktracker). Currently we use all task failures uniformly to penalize the tasktracker... clearly penalizing a tasktracker for an application error which results in an OOM (for e.g.) is injudicious. Does that make sense?
        Hide
        Arun C Murthy added a comment -

        I guess we should continue to debate the semantics of exception-handling in a separate jira - and get this in as an emergency fix.

        Wrt the patch, I'd be happier to with 2 changes:

        1. Don't rethrow exceptions:
        2. 'Unwrap' to get the 'cause' in Child.java (http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Throwable.html#getCause%28%29).

        Thoughts?

        Show
        Arun C Murthy added a comment - I guess we should continue to debate the semantics of exception-handling in a separate jira - and get this in as an emergency fix. Wrt the patch, I'd be happier to with 2 changes: Don't rethrow exceptions: 'Unwrap' to get the 'cause' in Child.java ( http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Throwable.html#getCause%28%29 ). Thoughts?
        Hide
        Arun C Murthy added a comment -

        To clarify: both catch clauses (Exception and Throwable) should do the unwrap via getCause() and use that if it's not null to report to TT, log etc.

        Show
        Arun C Murthy added a comment - To clarify: both catch clauses (Exception and Throwable) should do the unwrap via getCause() and use that if it's not null to report to TT, log etc.
        Hide
        Arun C Murthy added a comment -
        +  /** Report that the task encounted a fatal error.*/
        +  void fatalError(TaskAttemptID taskId, Throwable throwable) throws IOException;
        +  
        

        is wrong - it should be:

        +  /** Report that the task encounted a fatal error.*/
        +  void fatalError(TaskAttemptID taskId, String message) throws IOException;
        +  
        
        Show
        Arun C Murthy added a comment - + /** Report that the task encounted a fatal error.*/ + void fatalError(TaskAttemptID taskId, Throwable throwable) throws IOException; + is wrong - it should be: + /** Report that the task encounted a fatal error.*/ + void fatalError(TaskAttemptID taskId, String message) throws IOException; +
        Hide
        Amar Kamat added a comment -

        Attaching a patch that address Devaraj's and Arun's concerns. Result of test-patch
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Testing in progress.

        Show
        Amar Kamat added a comment - Attaching a patch that address Devaraj's and Arun's concerns. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Testing in progress.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.13.patch [ 12417714 ]
        Hide
        Amar Kamat added a comment -

        All tests except TestRecoveryManager (timeout) passed. Manually testing the patch.

        Show
        Amar Kamat added a comment - All tests except TestRecoveryManager (timeout) passed. Manually testing the patch.
        Hide
        Amar Kamat added a comment -

        Manually tested the patch by throwing errors in Child.java and ReduceTask#GetMapEventsThread. In both the cases the tasktracker is informed about the failure. Even TaskInProgress is informed about the error.

        Show
        Amar Kamat added a comment - Manually tested the patch by throwing errors in Child.java and ReduceTask#GetMapEventsThread. In both the cases the tasktracker is informed about the failure. Even TaskInProgress is informed about the error.
        Hide
        Amar Kamat added a comment -

        Also added sleep in testing to check if the tasktracker really kills the child process.

        Show
        Amar Kamat added a comment - Also added sleep in testing to check if the tasktracker really kills the child process.
        Hide
        Amar Kamat added a comment -

        Attaching test-patch results for 20
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories.

        -1 Eclipse classpath is related to HADOOP-6213. Running ant tests.

        Show
        Amar Kamat added a comment - Attaching test-patch results for 20 [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories. -1 Eclipse classpath is related to HADOOP-6213 . Running ant tests.
        Hide
        Amar Kamat added a comment -

        Attaching a patch for branch 0.20. Running ant tests.

        Show
        Amar Kamat added a comment - Attaching a patch for branch 0.20. Running ant tests.
        Amar Kamat made changes -
        Attachment MAPREDUCE-430-v1.13-branch-0.20.patch [ 12417729 ]
        Hide
        Amar Kamat added a comment -

        All ant tests for branch 0.20 passed except
        core : TestDistributedFileSystem and TestReduceFetch
        contrib : TestStreamingExitStatus, TestJobInitialization and TestQueueCapacities

        Show
        Amar Kamat added a comment - All ant tests for branch 0.20 passed except core : TestDistributedFileSystem and TestReduceFetch contrib : TestStreamingExitStatus, TestJobInitialization and TestQueueCapacities
        Hide
        Sharad Agarwal added a comment -

        I just committed to trunk and 0.20 branch. Thanks Amar!

        Show
        Sharad Agarwal added a comment - I just committed to trunk and 0.20 branch. Thanks Amar!
        Sharad Agarwal made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Hide
        Sharad Agarwal added a comment -

        Fixed a minor typo while committing to trunk in Task.java

        } catch (IOException ioe) {
              LOG.fatal("Failed to contact the tasktracker", throwable);
        }
        
        

        to

        } catch (IOException ioe) {
              LOG.fatal("Failed to contact the tasktracker", ioe);
        }
        
        
        Show
        Sharad Agarwal added a comment - Fixed a minor typo while committing to trunk in Task.java } catch (IOException ioe) { LOG.fatal( "Failed to contact the tasktracker" , throwable); } to } catch (IOException ioe) { LOG.fatal( "Failed to contact the tasktracker" , ioe); }
        Hide
        Amar Kamat added a comment -

        Attaching a patch for Yahoo Hadoop distribution. Example patch not to be committed.

        Show
        Amar Kamat added a comment - Attaching a patch for Yahoo Hadoop distribution. Example patch not to be committed.
        Amar Kamat made changes -
        Attachment MAPRED-430-v1.13-example.patch [ 12417751 ]
        Amar Kamat made changes -
        Release Note Various code paths in the framework caught Throwable and tried to do inline cleanup. In case of OOM errors, such inline-cleanups can result into hung jvms. With this fix, the TaskTracker provides a api to report fatal errors (any throwable other than FSErrror and Exceptions). On catching a Throwable, Mapper/Reducer tries to inform the TT.
        Hide
        Konstantin Boudnik added a comment -

        I had prepared this short fault injection patch as a part of my presentation for MR team. It can be used as a regression test for the JIRA to guarantee OOM in the particular spot of the application when is needed.

        Please note, that the patch require fault injection framework which is coming in HADOOP-6204 and can be reused my MapReduce.

        Show
        Konstantin Boudnik added a comment - I had prepared this short fault injection patch as a part of my presentation for MR team. It can be used as a regression test for the JIRA to guarantee OOM in the particular spot of the application when is needed. Please note, that the patch require fault injection framework which is coming in HADOOP-6204 and can be reused my MapReduce.
        Konstantin Boudnik made changes -
        Attachment MAPREDUCE-430.FI.patch [ 12417893 ]

          People

          • Assignee:
            Amar Kamat
            Reporter:
            Amareshwari Sriramadasu
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development