Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3738

NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.1, 0.24.0
    • Fix Version/s: 0.23.2
    • Component/s: mrv2, nodemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Committed to trunk and branch-0.23. Thanks Jason.
    • Target Version/s:

      Description

      If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown. The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished. Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:

      2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

      1. livehistdump.txt
        142 kB
        Jason Lowe
      2. MAPREDUCE-3738.patch
        5 kB
        Jason Lowe

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment -

          Jason,
          Is there a bug for OOM? What was the reason for that?

          Show
          Mahadev konar added a comment - Jason, Is there a bug for OOM? What was the reason for that?
          Hide
          Jason Lowe added a comment -

          No bug for the OOM yet, unfortunately cluster was re-deployed before grabbing a full heap dump. I do have the jmap -hist:live output from one of the nodemanagers but haven't had a chance to go through it yet to see if it helps pinpoint where the leak would be.

          Show
          Jason Lowe added a comment - No bug for the OOM yet, unfortunately cluster was re-deployed before grabbing a full heap dump. I do have the jmap -hist:live output from one of the nodemanagers but haven't had a chance to go through it yet to see if it helps pinpoint where the leak would be.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Originally when I wrote this, I had the same suspicion about the join. But later, I made sure all exceptions were caught and that the boolean gets set in all possible cases. OOM/errors are one thing that didn't occur to me.

          Can you debug as to why you ran into OOM ? We need to fix that definitely, irrespective of how we want to handle other errors.

          Show
          Vinod Kumar Vavilapalli added a comment - Originally when I wrote this, I had the same suspicion about the join. But later, I made sure all exceptions were caught and that the boolean gets set in all possible cases. OOM/errors are one thing that didn't occur to me. Can you debug as to why you ran into OOM ? We need to fix that definitely, irrespective of how we want to handle other errors.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Comment race Even the stack trace during OOM will help.

          Show
          Vinod Kumar Vavilapalli added a comment - Comment race Even the stack trace during OOM will help.
          Hide
          Jason Lowe added a comment -

          Attaching hist:live dump from one of the nodemanagers that had hit the OOM error multiple times in the log aggregation threads before eventually trying to shut down. Unfortunately I don't have a full map dump or stack dump from that process.

          Show
          Jason Lowe added a comment - Attaching hist:live dump from one of the nodemanagers that had hit the OOM error multiple times in the log aggregation threads before eventually trying to shut down. Unfortunately I don't have a full map dump or stack dump from that process.
          Hide
          Jason Lowe added a comment -

          Patch to ensure we always set the finished boolean in the log aggregation thread.

          On a side note we haven't seen a reoccurrence of the OOM condition on the nodemanager, so we haven't been able to track down what caused it.

          Show
          Jason Lowe added a comment - Patch to ensure we always set the finished boolean in the log aggregation thread. On a side note we haven't seen a reoccurrence of the OOM condition on the nodemanager, so we haven't been able to track down what caused it.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12515804/MAPREDUCE-3738.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515804/MAPREDUCE-3738.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//console This message is automatically generated.
          Hide
          Siddharth Seth added a comment -

          +1. Looks good.

          Show
          Siddharth Seth added a comment - +1. Looks good.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-0.23-Commit #587 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/587/)
          merge MAPREDUCE-3738 from trunk (Revision 1293061)

          Result = SUCCESS
          sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #587 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/587/ ) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Commit #574 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/574/)
          merge MAPREDUCE-3738 from trunk (Revision 1293061)

          Result = SUCCESS
          sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #574 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/574/ ) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Commit #589 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/589/)
          merge MAPREDUCE-3738 from trunk (Revision 1293061)

          Result = ABORTED
          sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #589 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/589/ ) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = ABORTED sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #178 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/178/)
          merge MAPREDUCE-3738 from trunk (Revision 1293061)

          Result = SUCCESS
          sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #178 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/178/ ) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Build #206 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/206/)
          merge MAPREDUCE-3738 from trunk (Revision 1293061)

          Result = FAILURE
          sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #206 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/206/ ) merge MAPREDUCE-3738 from trunk (Revision 1293061) Result = FAILURE sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

            People

            • Assignee:
              Jason Lowe
              Reporter:
              Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development