Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1119

When tasks fail to report status, show tasks's stack dump before killing

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.21.0
    • Component/s: tasktracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When the TT kills tasks that haven't reported status, it should somehow gather a stack dump for the task. This could be done either by sending a SIGQUIT (so the dump ends up in stdout) or perhaps something like JDI to gather the stack directly from Java. This may be somewhat tricky since the child may be running as another user (so the SIGQUIT would have to go through LinuxTaskController). This feature would make debugging these kinds of failures much easier, especially if we could somehow get it into the TaskDiagnostic message

      1. MAPREDUCE-1119.2.patch
        13 kB
        Aaron Kimball
      2. MAPREDUCE-1119.3.patch
        26 kB
        Aaron Kimball
      3. MAPREDUCE-1119.4.patch
        27 kB
        Aaron Kimball
      4. MAPREDUCE-1119.5.patch
        26 kB
        Aaron Kimball
      5. MAPREDUCE-1119.6.patch
        31 kB
        Aaron Kimball
      6. MAPREDUCE-1119.patch
        13 kB
        Aaron Kimball

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #164 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/164/)
          . When tasks fail to report status, show tasks's stack dump before killing. Contributed by Aaron Kimball.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #164 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/164/ ) . When tasks fail to report status, show tasks's stack dump before killing. Contributed by Aaron Kimball.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #143 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/143/)
          . When tasks fail to report status, show tasks's stack dump before killing. Contributed by Aaron Kimball.

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #143 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/143/ ) . When tasks fail to report status, show tasks's stack dump before killing. Contributed by Aaron Kimball.
          Hide
          Tom White added a comment -

          I've just committed this. Thanks Aaron!

          Show
          Tom White added a comment - I've just committed this. Thanks Aaron!
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The contrib test failure is related to MAPREDUCE-1124. This patch is good for check-in.

          Show
          Vinod Kumar Vavilapalli added a comment - The contrib test failure is related to MAPREDUCE-1124 . This patch is good for check-in.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12426315/MAPREDUCE-1119.6.patch
          against trunk revision 886059.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 17 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426315/MAPREDUCE-1119.6.patch against trunk revision 886059. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 17 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/160/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I am not very sure we want to use the same configuration property for sleeping after dump-stack

          So I think the delay is necessary. As for adding a new config item.. maybe wait til someone makes an issue of this? Is there any performance implication here?

          Given that the delay is necessary, it is only about the config item - new or the same one. Thinking a bit about it, i am fine with keeping the same configuration because that kind of gives the same time for thread-dump as the time that is given for the process to clean itself up. I know it isn't a very concrete reason but it works for me for now

          Gone through your latest patch. It looks good now. +1 from my side.

          I'll run it through Hudson once again to be sure about the test-cases. May be TestGridmixSubmission and TestJobHistory.testDoneFolderOnHDFS() failures are timing related.

          After Hudson blesses, can you ask someone to commit this? Thanks for being patient throughout!

          Show
          Vinod Kumar Vavilapalli added a comment - I am not very sure we want to use the same configuration property for sleeping after dump-stack So I think the delay is necessary. As for adding a new config item.. maybe wait til someone makes an issue of this? Is there any performance implication here? Given that the delay is necessary, it is only about the config item - new or the same one. Thinking a bit about it, i am fine with keeping the same configuration because that kind of gives the same time for thread-dump as the time that is given for the process to clean itself up. I know it isn't a very concrete reason but it works for me for now Gone through your latest patch. It looks good now. +1 from my side. I'll run it through Hudson once again to be sure about the test-cases. May be TestGridmixSubmission and TestJobHistory.testDoneFolderOnHDFS() failures are timing related. After Hudson blesses, can you ask someone to commit this? Thanks for being patient throughout!
          Hide
          Aaron Kimball added a comment -
          • GridMix has been sporadically failing for a while now in Hudson, but not in a deterministic fashion. Cannot reproduce locally.
          • The fair scheduler test times out on both my trunk and MAPREDUCE-1119 branches locally. MAPREDUCE-1245?
          • TestJobHistory passes locally both on my trunk and MAPREDUCE-1119 branches.
          Show
          Aaron Kimball added a comment - GridMix has been sporadically failing for a while now in Hudson, but not in a deterministic fashion. Cannot reproduce locally. The fair scheduler test times out on both my trunk and MAPREDUCE-1119 branches locally. MAPREDUCE-1245 ? TestJobHistory passes locally both on my trunk and MAPREDUCE-1119 branches.
          Hide
          Arun C Murthy added a comment -

          Which of these tests fail without our patch i.e. on a clean trunk?

          Show
          Arun C Murthy added a comment - Which of these tests fail without our patch i.e. on a clean trunk?
          Hide
          Aaron Kimball added a comment -

          GridMix and FairScheduler test failures are definitely unrelated. I ran TestJobHistory locally and it passes – also unrelated, I think?

          Show
          Aaron Kimball added a comment - GridMix and FairScheduler test failures are definitely unrelated. I ran TestJobHistory locally and it passes – also unrelated, I think?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12426315/MAPREDUCE-1119.6.patch
          against trunk revision 884832.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 17 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426315/MAPREDUCE-1119.6.patch against trunk revision 884832. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 17 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/277/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Attaching a new patch. This includes the above code review suggestions. The DefaultTaskController is tested by adding a subclass of TaskController for testing; this counts the number of times a dumpTaskStack() call is made, and ensures that it is incremented only during the appropriate jobs. The same strategy is employed for testing LinuxTaskController; ClusterWithLinuxTaskController.MyLinuxTaskController now counts SIGQUIT calls as well as any exceptional exit statuses from task-controller when administering the SIGQUIT to the client. Also improved ClusterWithLinuxTaskController's documentation as regards setting up the testcase a bit.

          All of these tests pass on my local machine.

          Show
          Aaron Kimball added a comment - Attaching a new patch. This includes the above code review suggestions. The DefaultTaskController is tested by adding a subclass of TaskController for testing; this counts the number of times a dumpTaskStack() call is made, and ensures that it is incremented only during the appropriate jobs. The same strategy is employed for testing LinuxTaskController ; ClusterWithLinuxTaskController.MyLinuxTaskController now counts SIGQUIT calls as well as any exceptional exit statuses from task-controller when administering the SIGQUIT to the client. Also improved ClusterWithLinuxTaskController 's documentation as regards setting up the testcase a bit. All of these tests pass on my local machine.
          Hide
          Aaron Kimball added a comment -

          To address your various points:

          src/c++/task-controller/main.c

          The program attempts to deduce its configuration directory based on argv[0]. Given that task-controller should be in $HADOOP_HOME/bin/, it will do some surgery on the program name to remove "bin/task-controller" and add the "conf" suffix, giving $HADOOP_HOME/conf. Unfortunately, this presumes that argv[0] is the absolute path to task-controller. If it's not, then this will result in a negative offset being calculated – and it will attempt to malloc a negative amount of space, which causes a SIGSEGV. This checks that task-controller was invoked with argv[0] set to the absolute path to task-controller, and not just something like './task-controller'

          Hadoop may invoke task-controller via its absolute path, but when I was unit testing it, I just ran './task-controller', which was causing it to segfault. This should exit politely instead.

          Now that the semantics have changed, I am not very sure we want to use the same configuration property for sleeping after dump-stack. (Thinking aloud..) Do we even need a sleep here? The signalling order is SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their arrival? If so, then we will not another sleep. If not, we may need a sleep here, but may or may not be driven by the same config item. What do you think?

          I'm not honestly sure. I believe that the operating system will raise the signals to the process in order. However, what it means to "handle" a signal is ambiguous. It may be the case that the signal handler for SIGQUIT merely sets a flag which tells another thread to print the stack dumps to stdout at some other convenient interval (e.g., during a gc when the world is stopped). So while SIGQUIT may be received by the JVM before SIGTERM, I don't take that as a guarantee that the data will reach stdout when we want it to. So I think the delay is necessary. As for adding a new config item.. maybe wait til someone makes an issue of this? Is there any performance implication here?

          Are we sure "PSPermGen" will always be there in the dump? Instead how about passing our own TaskController that does custom actions in TaskController.dumpStacks(), simplifying our verification that dump-stack is indeed called?

          In the Sun JVM, the stack dump messages always end with a description of the heap, which includes this key string. I doubt this is standardized; JRockit or another JVM will probably use some other string. Testing with a MockTaskController is probably a good idea here.

          The test-time can be more than halved if we set max-map-attempts to one in both the tests via conf.setMaxMapAttempts(1);

          Good observation. will do.

          We need a similar test for LinuxTaskController to test stack-dump when multiple users are involved.

          I'll look into this.

          I'll also handle the other more minor suggestions you made as well.
          Thanks for the review!

          Show
          Aaron Kimball added a comment - To address your various points: src/c++/task-controller/main.c The program attempts to deduce its configuration directory based on argv [0] . Given that task-controller should be in $HADOOP_HOME/bin/ , it will do some surgery on the program name to remove "bin/task-controller" and add the "conf" suffix, giving $HADOOP_HOME/conf . Unfortunately, this presumes that argv [0] is the absolute path to task-controller . If it's not, then this will result in a negative offset being calculated – and it will attempt to malloc a negative amount of space, which causes a SIGSEGV. This checks that task-controller was invoked with argv [0] set to the absolute path to task-controller, and not just something like './task-controller' Hadoop may invoke task-controller via its absolute path, but when I was unit testing it, I just ran './task-controller', which was causing it to segfault. This should exit politely instead. Now that the semantics have changed, I am not very sure we want to use the same configuration property for sleeping after dump-stack. (Thinking aloud..) Do we even need a sleep here? The signalling order is SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their arrival? If so, then we will not another sleep. If not, we may need a sleep here, but may or may not be driven by the same config item. What do you think? I'm not honestly sure. I believe that the operating system will raise the signals to the process in order. However, what it means to "handle" a signal is ambiguous. It may be the case that the signal handler for SIGQUIT merely sets a flag which tells another thread to print the stack dumps to stdout at some other convenient interval (e.g., during a gc when the world is stopped). So while SIGQUIT may be received by the JVM before SIGTERM, I don't take that as a guarantee that the data will reach stdout when we want it to. So I think the delay is necessary. As for adding a new config item.. maybe wait til someone makes an issue of this? Is there any performance implication here? Are we sure "PSPermGen" will always be there in the dump? Instead how about passing our own TaskController that does custom actions in TaskController.dumpStacks(), simplifying our verification that dump-stack is indeed called? In the Sun JVM, the stack dump messages always end with a description of the heap, which includes this key string. I doubt this is standardized; JRockit or another JVM will probably use some other string. Testing with a MockTaskController is probably a good idea here. The test-time can be more than halved if we set max-map-attempts to one in both the tests via conf.setMaxMapAttempts(1); Good observation. will do. We need a similar test for LinuxTaskController to test stack-dump when multiple users are involved. I'll look into this. I'll also handle the other more minor suggestions you made as well. Thanks for the review!
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The patch looks very clean now! Thanks! It is very close, I have only few comments on the latest patch, most of them minor:

          • Care explain the changes to src/c++/task-controller/main.c w.r.t conf_dir_len? Both for my confirmation as well as for record's sake..
          • Change c comments for kill_user_task() in src/c++task-controller/task-controller.c mentioning that it can terminate/kill or dump-stack?
          • Now that the semantics have changed, I am not very sure we want to use the same configuration property for sleeping after dump-stack. (Thinking aloud..) Do we even need a sleep here? The signalling order is SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their arrival? If so, then we will not another sleep. If not, we may need a sleep here, but may or may not be driven by the same config item. What do you think?
          • All the three newly added methods in JvmManager can be package-private or private.
          • ProcessTree.java:
            • Lot of refactoring. Nice!
            • The variables SIG* and SIG*_STR can all be private, so can maybeSignalProcess() and maybeSignalProcessGroup() be.
          • TestJobKillAndFail
            • Are we sure "PSPermGen" will always be there in the dump? Instead how about passing our own TaskController that does custom actions in TaskController.dumpStacks(), simplifying our verification that dump-stack is indeed called?
            • The test now takes very long time. The test-time can be more than halved if we set max-map-attempts to one in both the tests via conf.setMaxMapAttempts(1);
          • We need a similar test for LinuxTaskController to test stack-dump when multiple users are involved. You can look at TestLocalizationWithLinuxTaskController and/or TestJobExecutionAsDifferentUser for inspiration.
          Show
          Vinod Kumar Vavilapalli added a comment - The patch looks very clean now! Thanks! It is very close, I have only few comments on the latest patch, most of them minor: Care explain the changes to src/c++/task-controller/main.c w.r.t conf_dir_len? Both for my confirmation as well as for record's sake.. Change c comments for kill_user_task() in src/c++task-controller/task-controller.c mentioning that it can terminate/kill or dump-stack? Now that the semantics have changed, I am not very sure we want to use the same configuration property for sleeping after dump-stack. (Thinking aloud..) Do we even need a sleep here? The signalling order is SIGQUIT->SIGTERM->SIGKILL. Will signals be processed in the order of their arrival? If so, then we will not another sleep. If not, we may need a sleep here, but may or may not be driven by the same config item. What do you think? All the three newly added methods in JvmManager can be package-private or private. ProcessTree.java: Lot of refactoring. Nice! The variables SIG* and SIG*_STR can all be private, so can maybeSignalProcess() and maybeSignalProcessGroup() be. TestJobKillAndFail Are we sure "PSPermGen" will always be there in the dump? Instead how about passing our own TaskController that does custom actions in TaskController.dumpStacks() , simplifying our verification that dump-stack is indeed called? The test now takes very long time. The test-time can be more than halved if we set max-map-attempts to one in both the tests via conf.setMaxMapAttempts(1); We need a similar test for LinuxTaskController to test stack-dump when multiple users are involved. You can look at TestLocalizationWithLinuxTaskController and/or TestJobExecutionAsDifferentUser for inspiration.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12424806/MAPREDUCE-1119.5.patch
          against trunk revision 835237.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424806/MAPREDUCE-1119.5.patch against trunk revision 835237. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/241/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          I agree that making stack dumps an explicit operation rather than conflating it with killing child JVMs is cleaner logic. Here's a patch which provides the mechanism to do so.

          Note that with this patch, the SIGTERM is back in, making the kill() operation act like it did before.

          Show
          Aaron Kimball added a comment - I agree that making stack dumps an explicit operation rather than conflating it with killing child JVMs is cleaner logic. Here's a patch which provides the mechanism to do so. Note that with this patch, the SIGTERM is back in, making the kill() operation act like it did before.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          (ie only do so in the case of an unresponsive task)

          That is exactly what I intended. Let's just dump it only when TT calls purgeTask() in markUnresponsiveTasks() to kill an unresponsive task. It simply avoids a lot of other changes the last patch includes.

          With this, we will have the feature to dump stacks in place. We can use it in future if we wish to do the same with other cases.

          Show
          Vinod Kumar Vavilapalli added a comment - (ie only do so in the case of an unresponsive task) That is exactly what I intended. Let's just dump it only when TT calls purgeTask() in markUnresponsiveTasks() to kill an unresponsive task. It simply avoids a lot of other changes the last patch includes. With this, we will have the feature to dump stacks in place. We can use it in future if we wish to do the same with other cases.
          Hide
          Aaron Kimball added a comment -

          Todd, I think it's me who misunderstood I interpreted Vinod's comment entirely the opposite way, but it actually makes way more sense to handle it this way. Vinod, care to clarify your intent?

          Assuming you mean that we should never dump stack on exception, that seems like a reasonable approach. I'll see what I can throw together.

          Show
          Aaron Kimball added a comment - Todd, I think it's me who misunderstood I interpreted Vinod's comment entirely the opposite way, but it actually makes way more sense to handle it this way. Vinod, care to clarify your intent? Assuming you mean that we should never dump stack on exception, that seems like a reasonable approach. I'll see what I can throw together.
          Hide
          Todd Lipcon added a comment -

          Maybe I misunderstood, but I thought Vinod's suggestion was to "remove the uncertainty" by making sure it never dumps stack in the case of a child error (ie only do so in the case of an unresponsive task). That is what makes most sense to me, since the child error itself is already setting stack traces in diagnostic info, etc.

          Show
          Todd Lipcon added a comment - Maybe I misunderstood, but I thought Vinod's suggestion was to "remove the uncertainty" by making sure it never dumps stack in the case of a child error (ie only do so in the case of an unresponsive task). That is what makes most sense to me, since the child error itself is already setting stack traces in diagnostic info, etc.
          Hide
          Aaron Kimball added a comment -

          Vinod,

          I'm not sure what markUnresponsiveTasks() has to do with anything; that's not called from either fsError() or fatalError(), which are how the child signals an exception back to the TT.

          Even still, I do not believe that calling dumpTaskStack() from within fsError/fatalError solves the problem either. Signals are delivered asynchronously. Even if the signal were sent inside of fsError() before its call to purgeTask(), there is an arbitrary amount of time before that is handled by the child program. (Not to mention that purgeTask()'s logic in this patch is 'send SIGQUIT, wait, then SIGKILL' – so doing the SIGQUIT before purgeTask() does not change the effective serialization.) Furthermore, the act of printing the stack trace to stderr itself takes an arbitrary amount of time. The child program after calling TaskUmbilicalProtocol.fsError() or fatalError() will still immediately exit all threads; there is no guarantee that it will service the signal before terminating naturally.

          Based on the logic in org.apache.hadoop.mapred.Child, it is unlikely that the purgeTask() associated with TUP.fsError()/fatalError() is actually killing the task; it's probably exiting of its own accord first (otherwise, we'd be seeing the stack dumps since the SIGQUIT is serialized before the SIGKILL as-is). So I don't think this removes any nondeterminism from the equation. If we want child exceptions to definitely trigger a stack dump in the event of an exception, then the child should enumerate its own threads, call Thread.getStackTrace() on them, and print them to stderr itself to ensure that this process completes before exiting the JVM, without relying on asynchronous signals.

          Show
          Aaron Kimball added a comment - Vinod, I'm not sure what markUnresponsiveTasks() has to do with anything; that's not called from either fsError() or fatalError(), which are how the child signals an exception back to the TT. Even still, I do not believe that calling dumpTaskStack() from within fsError/fatalError solves the problem either. Signals are delivered asynchronously. Even if the signal were sent inside of fsError() before its call to purgeTask(), there is an arbitrary amount of time before that is handled by the child program. (Not to mention that purgeTask()'s logic in this patch is 'send SIGQUIT, wait, then SIGKILL' – so doing the SIGQUIT before purgeTask() does not change the effective serialization.) Furthermore, the act of printing the stack trace to stderr itself takes an arbitrary amount of time. The child program after calling TaskUmbilicalProtocol.fsError() or fatalError() will still immediately exit all threads; there is no guarantee that it will service the signal before terminating naturally. Based on the logic in org.apache.hadoop.mapred.Child, it is unlikely that the purgeTask() associated with TUP.fsError()/fatalError() is actually killing the task; it's probably exiting of its own accord first (otherwise, we'd be seeing the stack dumps since the SIGQUIT is serialized before the SIGKILL as-is). So I don't think this removes any nondeterminism from the equation. If we want child exceptions to definitely trigger a stack dump in the event of an exception, then the child should enumerate its own threads, call Thread.getStackTrace() on them, and print them to stderr itself to ensure that this process completes before exiting the JVM, without relying on asynchronous signals.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          We can avoid the uncertainity with dumping stack in case of Child exception by directly calling taskController.dumpTaskStack(context) from inside TaskTracker.markUnresponsiveTasks() immediately before purgeTask(tip,true). This will create the dump only when it is absolutely needed.

          Would that work? To construct the context, you may need a bridging method inside JvmManager which can itself call taskController.dumpTaskStack(context).

          That will make the patch a lot simpler, and will avoid many other avoidable changes to JmvManager/TaskController. If you wish, I can upload a demonstrating patch. Thoughts?

          Show
          Vinod Kumar Vavilapalli added a comment - We can avoid the uncertainity with dumping stack in case of Child exception by directly calling taskController.dumpTaskStack(context) from inside TaskTracker.markUnresponsiveTasks() immediately before purgeTask(tip,true) . This will create the dump only when it is absolutely needed. Would that work? To construct the context, you may need a bridging method inside JvmManager which can itself call taskController.dumpTaskStack(context) . That will make the patch a lot simpler, and will avoid many other avoidable changes to JmvManager / TaskController . If you wish, I can upload a demonstrating patch. Thoughts?
          Hide
          Aaron Kimball added a comment -

          Error is in gridmix contrib.

          Show
          Aaron Kimball added a comment - Error is in gridmix contrib.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12424659/MAPREDUCE-1119.4.patch
          against trunk revision 834284.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424659/MAPREDUCE-1119.4.patch against trunk revision 834284. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/238/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          +1 pending Hudson QA results from the last patch (don't see why it wouldn't be fine given it's just javadoc changes since last one)

          Show
          Todd Lipcon added a comment - +1 pending Hudson QA results from the last patch (don't see why it wouldn't be fine given it's just javadoc changes since last one)
          Hide
          Aaron Kimball added a comment -

          I do like my fractional bonus points. Here's a new patch with a more thorough javadoc on TaskController.destroyTaskJVM()

          Show
          Aaron Kimball added a comment - I do like my fractional bonus points. Here's a new patch with a more thorough javadoc on TaskController.destroyTaskJVM()
          Hide
          Todd Lipcon added a comment -

          Hey Aaron,

          Thanks for doing the investigation to fill in the table.

          The behavior as you've described it all seems pretty reasonable - I wish the "maybe" were "false", but it sounds like it does require a pretty significant overhaul of failure tracking throughout the TT/TaskManager/etc, so I'd be inclined to say it's out of scope.

          So, +0.5 from me on the current patch. Full +1 if some of the above explanation could be transformed into javadoc on the wasFailure parameter - probably just teh full explanation where it first appears, and then a "@see FirstAppearanceClass.someMethodName" for that parameter elsewhere, so we don't have duplication of the explanation.

          Show
          Todd Lipcon added a comment - Hey Aaron, Thanks for doing the investigation to fill in the table. The behavior as you've described it all seems pretty reasonable - I wish the "maybe" were "false", but it sounds like it does require a pretty significant overhaul of failure tracking throughout the TT/TaskManager/etc, so I'd be inclined to say it's out of scope. So, +0.5 from me on the current patch. Full +1 if some of the above explanation could be transformed into javadoc on the wasFailure parameter - probably just teh full explanation where it first appears, and then a "@see FirstAppearanceClass.someMethodName" for that parameter elsewhere, so we don't have duplication of the explanation.
          Hide
          Aaron Kimball added a comment -

          I modeled this parameter after the fact that TaskTracker was already using this same name (see setTaskFailState(), jobHasFinished(), kill(), cleanUpOverMemoryTask()) to indicate whether a kill was failure-based or for other purposes (cleanup/preemption/etc).

          I think there is perhaps a more systemic overhaul of failure reason tracking that should occur as a separate issue?

          As for your table... if you look at TaskController.destroyTaskJVM() (line 151), you can see that generates-stack is true iff wasFailure is true.

          I ran some tests by running sleep job which slept for 60 seconds in each call to map(). Results follow:

          Test case Stack dump?
          set mapreduce.task.timeout to 10000 (task timeout) yes
          ran bin/mapred job -kill-task on attempts no
          ran bin/mapred job -fail-task on attempts no
          Let it complete successfully no
          ran bin/mapred job -kill on the job itself. no
          threw a RuntimeException in the mapper no

          Thus, I believe that translates into the following for your table:

          Reason wasFailure generates stack
          Child exception maybe maybe*
          Other task failures false false
          Task timeout true true
          Task killed by user false false
          Task failed by user false false
          Job killed by user false false

          Looking at org.apache.hadoop.mapred.Child, there are a few different catch blocks in there:

          • If a task throws a FSError, this triggers TaskUmbilicalProtocol.fsError(), which will cause a purgeTask(tip, wasFailure=true).
          • If a task throws any other sort of Exception, this does not trigger a particular response to the TUP; The exception string is passed to TaskTracker.reportDiagnosticInfo(), but this simply logs a string of text and takes no further action.
          • If a map task throws any other Throwable, this triggers TUP.fatalError(), which also calls purgeTask(tip, wasFailure=true).

          But immediately after these catch blocks, it closes the RPC proxy, shuts down the logging thread, and exits the JVM. So fsError and fatalError may cause a stack dump if the TT processes the request fast enough and issues a SIGQUIT in the next few microseconds. But this is racing against the fact that the child task's next action is "exit immediately."

          Note that job kill, task timeout, and the task exception cases are all covered in the unit test provided in this patch.

          Show
          Aaron Kimball added a comment - I modeled this parameter after the fact that TaskTracker was already using this same name (see setTaskFailState() , jobHasFinished() , kill() , cleanUpOverMemoryTask() ) to indicate whether a kill was failure-based or for other purposes (cleanup/preemption/etc). I think there is perhaps a more systemic overhaul of failure reason tracking that should occur as a separate issue? As for your table... if you look at TaskController.destroyTaskJVM() (line 151), you can see that generates-stack is true iff wasFailure is true. I ran some tests by running sleep job which slept for 60 seconds in each call to map() . Results follow: Test case Stack dump? set mapreduce.task.timeout to 10000 (task timeout) yes ran bin/mapred job -kill-task on attempts no ran bin/mapred job -fail-task on attempts no Let it complete successfully no ran bin/mapred job -kill on the job itself. no threw a RuntimeException in the mapper no Thus, I believe that translates into the following for your table: Reason wasFailure generates stack Child exception maybe maybe* Other task failures false false Task timeout true true Task killed by user false false Task failed by user false false Job killed by user false false Looking at org.apache.hadoop.mapred.Child , there are a few different catch blocks in there: If a task throws a FSError , this triggers TaskUmbilicalProtocol.fsError() , which will cause a purgeTask(tip, wasFailure=true) . If a task throws any other sort of Exception , this does not trigger a particular response to the TUP; The exception string is passed to TaskTracker.reportDiagnosticInfo() , but this simply logs a string of text and takes no further action. If a map task throws any other Throwable , this triggers TUP.fatalError() , which also calls purgeTask(tip, wasFailure=true) . But immediately after these catch blocks, it closes the RPC proxy, shuts down the logging thread, and exits the JVM. So fsError and fatalError may cause a stack dump if the TT processes the request fast enough and issues a SIGQUIT in the next few microseconds. But this is racing against the fact that the child task's next action is "exit immediately." Note that job kill, task timeout, and the task exception cases are all covered in the unit test provided in this patch.
          Hide
          Todd Lipcon added a comment -

          New patch looks better. A couple changes before I think it's ready for commit:

          • The "wasFailure" boolean isn't terribly clear. I'd like to either see some more javadoc "@param" tags, or replace it with a KillReason enum. Basically I want to be able to fill in this table:
            Reason wasFailure generates stack
            Child threw exception ? ?
            Job failed due to other tasks failing ? ?
            Task timed out ? ?
            Task killed by user (incl preemption) ? ?
            Job killed by user ? ?

          Perhaps I'm just being dense, but the above isn't easy to understand from the existing code. If this was a "preexisting condition" (sorry!) then maybe it's out of scope for this JIRA.

          Show
          Todd Lipcon added a comment - New patch looks better. A couple changes before I think it's ready for commit: The "wasFailure" boolean isn't terribly clear. I'd like to either see some more javadoc "@param" tags, or replace it with a KillReason enum. Basically I want to be able to fill in this table: Reason wasFailure generates stack Child threw exception ? ? Job failed due to other tasks failing ? ? Task timed out ? ? Task killed by user (incl preemption) ? ? Job killed by user ? ? Perhaps I'm just being dense, but the above isn't easy to understand from the existing code. If this was a "preexisting condition" (sorry!) then maybe it's out of scope for this JIRA.
          Hide
          Aaron Kimball added a comment -

          Seems as though the release audit warning checker is broken...

          Show
          Aaron Kimball added a comment - Seems as though the release audit warning checker is broken...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12424559/MAPREDUCE-1119.3.patch
          against trunk revision 834284.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 159 release audit warnings (more than the trunk's current 0 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424559/MAPREDUCE-1119.3.patch against trunk revision 834284. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 159 release audit warnings (more than the trunk's current 0 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/132/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Attaching new patch to address code review issues.

          • Renamed QUIT_TASK_JVM to SIGQUIT_TASK_JVM
          • Task timeout causes SIGQUIT; other task kill events do not.
            • modified the various calls in the call chain for task kill to pass along a wasFailure bit
            • modified all associated call-sites to forward along existing wasFailure bit, or generate a new true or false as appropriate.
            • modified TestJobKillAndFail to distinguish between job kill and task timeout failure conditions and whether or not those deserved stack dumps.
          • If a SIGQUIT is issued before a SIGKILL, SIGTERM is not.
          • Refactored common code in ProcessTree
          Show
          Aaron Kimball added a comment - Attaching new patch to address code review issues. Renamed QUIT_TASK_JVM to SIGQUIT_TASK_JVM Task timeout causes SIGQUIT ; other task kill events do not. modified the various calls in the call chain for task kill to pass along a wasFailure bit modified all associated call-sites to forward along existing wasFailure bit, or generate a new true or false as appropriate. modified TestJobKillAndFail to distinguish between job kill and task timeout failure conditions and whether or not those deserved stack dumps. If a SIGQUIT is issued before a SIGKILL, SIGTERM is not. Refactored common code in ProcessTree
          Hide
          Aaron Kimball added a comment -

          Seems so; TaskRunner and JvmManager are both package-visible classes. So that makes them MapReduce-internal. (But these classes are not yet annotated with our API classifications.)

          Show
          Aaron Kimball added a comment - Seems so; TaskRunner and JvmManager are both package-visible classes. So that makes them MapReduce-internal. (But these classes are not yet annotated with our API classifications.)
          Hide
          Doug Cutting added a comment -

          > Is all this worth it just to avoid stack dumps in aborted speculative task attempts?

          Probably so, although I've not looked at the details. A task timing out is a bad thing. A speculative task getting killed is normal. One should be able to easily scan a job's log for bad things.

          These are all internal APIs, right? If so, it shouldn't create problems to change them, should it?

          Show
          Doug Cutting added a comment - > Is all this worth it just to avoid stack dumps in aborted speculative task attempts? Probably so, although I've not looked at the details. A task timing out is a bad thing. A speculative task getting killed is normal. One should be able to easily scan a job's log for bad things. These are all internal APIs, right? If so, it shouldn't create problems to change them, should it?
          Hide
          Aaron Kimball added a comment -

          Actually, I suppose that if it comes from the JT, then it's definitely a speculative task attempt, right? Task attempt timeouts are actually between the attempt and the TT, and the JT isn't involved at all.

          In the event of a timeout, markUnresponsiveTasks() calls TaskTracker.purgeTask(tip, wasFailure=true) calls tip.jobHasFinished(wasFailure) which calls tip.kill(wasFailure).

          Unfortunately, here's where the train of failure/non-failure data for why the task should be killed, ends. This calls TaskRunner.kill() which calls JvmManager.taskKilled(this), which calls JvmManagerForType.taskKilled(taskRunner), calls JvmMgrForType.killJvm(jvmId), calls JvmRunner.kill(), calls TaskController.destroyTaskJvm(TaskControllerContext). (Someone please correct me if I'm wrong.)

          But TaskRunner.kill() doesn't get a reason code like wasFailure. This could be changed, but then we'd also need to modify JvmManager, and add a synchronized/volatile call to hand off this data into the TaskControllerContext object. Is all this worth it just to avoid stack dumps in aborted speculative task attempts?

          Show
          Aaron Kimball added a comment - Actually, I suppose that if it comes from the JT, then it's definitely a speculative task attempt, right? Task attempt timeouts are actually between the attempt and the TT, and the JT isn't involved at all. In the event of a timeout, markUnresponsiveTasks() calls TaskTracker.purgeTask(tip, wasFailure=true) calls tip.jobHasFinished(wasFailure) which calls tip.kill(wasFailure). Unfortunately, here's where the train of failure/non-failure data for why the task should be killed, ends. This calls TaskRunner.kill() which calls JvmManager.taskKilled(this), which calls JvmManagerForType.taskKilled(taskRunner), calls JvmMgrForType.killJvm(jvmId), calls JvmRunner.kill(), calls TaskController.destroyTaskJvm(TaskControllerContext). (Someone please correct me if I'm wrong.) But TaskRunner.kill() doesn't get a reason code like wasFailure. This could be changed, but then we'd also need to modify JvmManager, and add a synchronized/volatile call to hand off this data into the TaskControllerContext object. Is all this worth it just to avoid stack dumps in aborted speculative task attempts?
          Hide
          Aaron Kimball added a comment -

          Touche.

          Does the jobtracker send information to the tasktracker, though, informing it of whether it is a timeout-based kill or simply a redundant speculative task attempt?

          Show
          Aaron Kimball added a comment - Touche. Does the jobtracker send information to the tasktracker, though, informing it of whether it is a timeout-based kill or simply a redundant speculative task attempt?
          Hide
          Doug Cutting added a comment -

          > receiving a SIGKILL implies that something has gone wrong in the process.

          How are speculatively executed tasks terminated?

          Show
          Doug Cutting added a comment - > receiving a SIGKILL implies that something has gone wrong in the process. How are speculatively executed tasks terminated?
          Hide
          Aaron Kimball added a comment -

          To address all three reviews above:

          • I'll rename the constant to SIGQUIT_TASK_JVM; understandability overrides consistency. "CORE_DUMP_JVM" is also a misnomer; if anything, the proper alternative would be "STACK_DUMP_JVM".
          • I'll adapt the logic to go SIGQUIT, SIGKILL instead of SIGQUIT, SIGTERM, SIGKILL. Use the same constant (sleepTimeBeforeSigkill) as already exists.
          • ProcessTree and DefaultTaskController both contain a lot of redundant code clones; if people are okay with it, I'm +1 on condensing a lot of the logic. The initial version of this patch didn't modify anything extraneous, but since the earlier versions of this patch make its intent clear, I'll write a broader one that cleans up the other code "in the neighborhood."
          • I'll rename the current logic in finishTask() to sendSignal(); we can preserve a method named finishTask() that specifically sends SIGKILL using sendSignal()
          This currently causes stack traces for all killed tasks, right? I don't personally have a problem with that, but the description of the JIRA indicates that only those due to failing to report status will dump their stack, and it's worth noting the difference.

          I've tried tracing backwards through the code to figure out what triggers JVM kills, but it's a lengthy string of methods involved. What other operations are there (besides task timeout) which wind up killing the task? The TaskController.destroyTaskJVM() method is only called from JvmManager.kill(), which itself receives no information as to why it's killing the JVM in question. (There are no arguments to this method; there's no flag in there to determine that the jvm under management has experienced a timeout.)

          I actually claim that actually receiving a SIGKILL implies that something has gone wrong in the process. Tasks that clean up "politely" do not get the SIGKILL, and thus also do not get the SIGQUIT/dump stacks. So if something's gone wrong, then in my mind, we should capture the stack trace for debug purposes.

          Otherwise, we will need to modify the JvmManager API to be more precise about the nature of kills. I suppose I could live with that, but someone else needs to point me to where the information about this lives. So this would add complexity, and in my mind, be less useful.

          • As for sigQuitProcessGroup(), I modeled the logic in dumpTaskStack() after that of all the other signalling methods. It's conceivable that a user's task process hangs because it spawns subprocesses and then gets in a deadlock based on IPC between the task and its subprocesses. I think the better question is: Is there a good reason not to send SIGQUIT to the entire child process group? (e.g., substantial overhead, especially that which blocks the TT?) I don't think there's any more overhead here on the TT's side than sending SIGTERM to the process group, which we already do. If we maintain this same logical structure in dumpTaskStack as we do in killTask, etc, then we can refactor this code down into a considerably more condensed form involving better code sharing. Otherwise, dumpTaskStack will have to remain a special case.
          Show
          Aaron Kimball added a comment - To address all three reviews above: I'll rename the constant to SIGQUIT_TASK_JVM; understandability overrides consistency. "CORE_DUMP_JVM" is also a misnomer; if anything, the proper alternative would be "STACK_DUMP_JVM". I'll adapt the logic to go SIGQUIT, SIGKILL instead of SIGQUIT, SIGTERM, SIGKILL. Use the same constant (sleepTimeBeforeSigkill) as already exists. ProcessTree and DefaultTaskController both contain a lot of redundant code clones; if people are okay with it, I'm +1 on condensing a lot of the logic. The initial version of this patch didn't modify anything extraneous, but since the earlier versions of this patch make its intent clear, I'll write a broader one that cleans up the other code "in the neighborhood." I'll rename the current logic in finishTask() to sendSignal() ; we can preserve a method named finishTask() that specifically sends SIGKILL using sendSignal() This currently causes stack traces for all killed tasks, right? I don't personally have a problem with that, but the description of the JIRA indicates that only those due to failing to report status will dump their stack, and it's worth noting the difference. I've tried tracing backwards through the code to figure out what triggers JVM kills, but it's a lengthy string of methods involved. What other operations are there (besides task timeout) which wind up killing the task? The TaskController.destroyTaskJVM() method is only called from JvmManager.kill() , which itself receives no information as to why it's killing the JVM in question. (There are no arguments to this method; there's no flag in there to determine that the jvm under management has experienced a timeout.) I actually claim that actually receiving a SIGKILL implies that something has gone wrong in the process. Tasks that clean up "politely" do not get the SIGKILL, and thus also do not get the SIGQUIT/dump stacks. So if something's gone wrong, then in my mind, we should capture the stack trace for debug purposes. Otherwise, we will need to modify the JvmManager API to be more precise about the nature of kills. I suppose I could live with that, but someone else needs to point me to where the information about this lives. So this would add complexity, and in my mind, be less useful. As for sigQuitProcessGroup(), I modeled the logic in dumpTaskStack() after that of all the other signalling methods. It's conceivable that a user's task process hangs because it spawns subprocesses and then gets in a deadlock based on IPC between the task and its subprocesses. I think the better question is: Is there a good reason not to send SIGQUIT to the entire child process group? (e.g., substantial overhead, especially that which blocks the TT?) I don't think there's any more overhead here on the TT's side than sending SIGTERM to the process group, which we already do. If we maintain this same logical structure in dumpTaskStack as we do in killTask, etc, then we can refactor this code down into a considerably more condensed form involving better code sharing. Otherwise, dumpTaskStack will have to remain a special case.
          Hide
          Hemanth Yamijala added a comment -

          LinuxTaskController.finishTask is now sort of a misnomer, since you're using it to send SIGQUIT. Maybe rename to sendKillSignal or something?

          signalTask ? The task-controller was modeled on the lines of launch/kill a JVM, initialize/finish a job, initialize/finish task etc. It would be good to maintain those abstractions. I haven't looked at the patch, but can we make signalTask additional to the existing APIs. The implementation could of course reuse code. Thoughts ?

          Show
          Hemanth Yamijala added a comment - LinuxTaskController.finishTask is now sort of a misnomer, since you're using it to send SIGQUIT. Maybe rename to sendKillSignal or something? signalTask ? The task-controller was modeled on the lines of launch/kill a JVM, initialize/finish a job, initialize/finish task etc. It would be good to maintain those abstractions. I haven't looked at the patch, but can we make signalTask additional to the existing APIs. The implementation could of course reuse code. Thoughts ?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          This currently causes stack traces for all killed tasks, right? I don't personally have a problem with that, but the description of the JIRA indicates that only those due to failing to report status will dump their stack, and it's worth noting the difference.

          We shouldn't be doing this. Dump is needed only for tasks that TaskTracker forcefully kills them when it thinks that the task in question is hung.

          in destroyTaskJVM, you sleep for sleeptimeBeforeSigkill in between the SIGQUIT and the SIGKILL. This seems wrong - it should probably be a different timeout.

          That timeout was added so that the process can do whatever cleanup it wishes to do before a SIGKILL arrives. So I think, it's fine to use the same configuration. But I think we should just skip SIGTERM and go directly to SIGKILL after SIGQUIT. We know for sure the task is hung and we are forcibly killing it, so skipping SIGTERM in this case will avoid further waiting before SIGKILL.

          I'd prefer SIGQUIT_TASK_JVM rather than QUIT_TASK_JVM for clarity's sake. It's a little less consistent, but more obvious for people reading the code later on.

          +1. Either that or CORE_DUMP_TASK_JVM. No bias though.

          LinuxTaskController.finishTask is now sort of a misnomer, since you're using it to send SIGQUIT. Maybe rename to sendKillSignal or something?

          +1

          I have one more point. We need dump of the task JVM itsellf, and (I think) not of the child processes of the JVM So no sigQuitProcessGroup()? Thoughts?

          Show
          Vinod Kumar Vavilapalli added a comment - This currently causes stack traces for all killed tasks, right? I don't personally have a problem with that, but the description of the JIRA indicates that only those due to failing to report status will dump their stack, and it's worth noting the difference. We shouldn't be doing this. Dump is needed only for tasks that TaskTracker forcefully kills them when it thinks that the task in question is hung. in destroyTaskJVM, you sleep for sleeptimeBeforeSigkill in between the SIGQUIT and the SIGKILL. This seems wrong - it should probably be a different timeout. That timeout was added so that the process can do whatever cleanup it wishes to do before a SIGKILL arrives. So I think, it's fine to use the same configuration. But I think we should just skip SIGTERM and go directly to SIGKILL after SIGQUIT. We know for sure the task is hung and we are forcibly killing it, so skipping SIGTERM in this case will avoid further waiting before SIGKILL. I'd prefer SIGQUIT_TASK_JVM rather than QUIT_TASK_JVM for clarity's sake. It's a little less consistent, but more obvious for people reading the code later on. +1. Either that or CORE_DUMP_TASK_JVM. No bias though. LinuxTaskController.finishTask is now sort of a misnomer, since you're using it to send SIGQUIT. Maybe rename to sendKillSignal or something? +1 I have one more point. We need dump of the task JVM itsellf, and (I think) not of the child processes of the JVM So no sigQuitProcessGroup()? Thoughts?
          Hide
          Todd Lipcon added a comment -
          • I'd prefer SIGQUIT_TASK_JVM rather than QUIT_TASK_JVM for clarity's sake. It's a little less consistent, but more obvious for people reading the code later on.
          • in destroyTaskJVM, you sleep for sleeptimeBeforeSigkill in between the SIGQUIT and the SIGKILL. This seems wrong - it should probably be a different timeout.
          • ProcessTree.java: move SIGQUIT constant to top of class
          • ProcessTree.java: some whitespace indentation seems wrong for sigQuitProcess
          • Can we refactor some of the various signal-sending code in ProcessTree to share common code? Lots of very similar methods.
          • This currently causes stack traces for all killed tasks, right? I don't personally have a problem with that, but the description of the JIRA indicates that only those due to failing to report status will dump their stack, and it's worth noting the difference.
          • LinuxTaskController.finishTask is now sort of a misnomer, since you're using it to send SIGQUIT. Maybe rename to sendKillSignal or something?
          Show
          Todd Lipcon added a comment - I'd prefer SIGQUIT_TASK_JVM rather than QUIT_TASK_JVM for clarity's sake. It's a little less consistent, but more obvious for people reading the code later on. in destroyTaskJVM, you sleep for sleeptimeBeforeSigkill in between the SIGQUIT and the SIGKILL. This seems wrong - it should probably be a different timeout. ProcessTree.java: move SIGQUIT constant to top of class ProcessTree.java: some whitespace indentation seems wrong for sigQuitProcess Can we refactor some of the various signal-sending code in ProcessTree to share common code? Lots of very similar methods. This currently causes stack traces for all killed tasks, right? I don't personally have a problem with that, but the description of the JIRA indicates that only those due to failing to report status will dump their stack, and it's worth noting the difference. LinuxTaskController.finishTask is now sort of a misnomer, since you're using it to send SIGQUIT. Maybe rename to sendKillSignal or something?
          Hide
          Aaron Kimball added a comment -

          Contrib failure is unrelated

          Show
          Aaron Kimball added a comment - Contrib failure is unrelated
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423061/MAPREDUCE-1119.2.patch
          against trunk revision 828979.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423061/MAPREDUCE-1119.2.patch against trunk revision 828979. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/206/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          New patch, addresses merge conflit.

          Show
          Aaron Kimball added a comment - New patch, addresses merge conflit.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12423057/MAPREDUCE-1119.patch
          against trunk revision 828979.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/205/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423057/MAPREDUCE-1119.patch against trunk revision 828979. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/205/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Attaching a patch which performs this function.

          Stack traces are added to the stdout of the task itself via SIGQUIT; this naturally lets it get collected in the stdout log of the task.

          This patch modifies the API of TaskController to include a dumpTaskStack() method that invokes SIGQUIT.

          In DefaultTaskController, this is actually performed by ProcessTree. The LinuxTaskController will send a new opcode TaskCommands.QUIT_TASK_JVM to the task-controller module; this sends the SIGQUIT signal itself to the client.

          The existing behavior of TaskController.destroyTaskJVM() is to send SIGTERM, sleep for context.sleeptimeBeforeSigkill and then send SIGKILL; I've modified this method so that it goes SIGQUIT/sleep/SIGTERM/sleep/SIGKILL. The sleep is necessary after the SIGQUIT to give the task time to actually do the stack dump before it has to handle SIGTERM.

          I tested this by running some jobs which time out and verified that they got the stack dumps in their task stdout logs; jobs which succeed do not. I did this with both the DefaultTaskController and the LinuxTaskController. I also added a unit test to the patch which checks that evidence of a stack dump appears in the stdout log for a task which is killed by the unit test.

          While I was in the task-controller c++ module, I discovered a segfault which is also fixed in this patch. If HADOOP_CONF_DIR isn't defined, it expects argv[0] to be the full path to task-controller so it can find the conf dir based on this. If you just run ./task-controller, this will try to malloc a negative amount of space. I changed it to gracefully exit with an error message in this case. (Simple fix; no unit test case.)

          Show
          Aaron Kimball added a comment - Attaching a patch which performs this function. Stack traces are added to the stdout of the task itself via SIGQUIT ; this naturally lets it get collected in the stdout log of the task. This patch modifies the API of TaskController to include a dumpTaskStack() method that invokes SIGQUIT . In DefaultTaskController , this is actually performed by ProcessTree . The LinuxTaskController will send a new opcode TaskCommands.QUIT_TASK_JVM to the task-controller module; this sends the SIGQUIT signal itself to the client. The existing behavior of TaskController.destroyTaskJVM() is to send SIGTERM , sleep for context.sleeptimeBeforeSigkill and then send SIGKILL ; I've modified this method so that it goes SIGQUIT/sleep/SIGTERM/sleep/SIGKILL. The sleep is necessary after the SIGQUIT to give the task time to actually do the stack dump before it has to handle SIGTERM. I tested this by running some jobs which time out and verified that they got the stack dumps in their task stdout logs; jobs which succeed do not. I did this with both the DefaultTaskController and the LinuxTaskController. I also added a unit test to the patch which checks that evidence of a stack dump appears in the stdout log for a task which is killed by the unit test. While I was in the task-controller c++ module, I discovered a segfault which is also fixed in this patch. If HADOOP_CONF_DIR isn't defined, it expects argv [0] to be the full path to task-controller so it can find the conf dir based on this. If you just run ./task-controller , this will try to malloc a negative amount of space. I changed it to gracefully exit with an error message in this case. (Simple fix; no unit test case.)
          Hide
          Arun C Murthy added a comment -

          +1 for the direction.

          Show
          Arun C Murthy added a comment - +1 for the direction.
          Hide
          Hong Tang added a comment -

          +1. Understanding why tasks fail would be very valuable to help both framework devs and users.

          Show
          Hong Tang added a comment - +1. Understanding why tasks fail would be very valuable to help both framework devs and users.

            People

            • Assignee:
              Aaron Kimball
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development