Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4797

LocalContainerAllocator can loop forever trying to contact the RM

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.3, 2.0.1-alpha
    • Fix Version/s: 2.0.3-alpha, 0.23.5
    • Component/s: applicationmaster
    • Labels:
      None

      Description

      If LocalContainerAllocator has trouble communicating with the RM it can end up retrying forever if the nature of the error is not a YarnException.

      This can be particulary bad if the connection went down because the cluster was reset such that the RM and NM have lost track of the process and therefore nothing else will eventually kill the process. In this scenario, the looping AM continues to pelt the RM with connection requests every second using a stale token, and the RM logs the SASL exceptions over and over.

      1. MAPREDUCE-4797.patch
        6 kB
        Jason Lowe
      2. MAPREDUCE-4797.patch
        6 kB
        Jason Lowe

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1258 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1258/)
        MAPREDUCE-4797. LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1258 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1258/ ) MAPREDUCE-4797 . LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1227 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1227/)
        MAPREDUCE-4797. LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1227 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1227/ ) MAPREDUCE-4797 . LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #436 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/436/)
        svn merge -c 1409525 FIXES: MAPREDUCE-4797. LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409532)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409532
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #436 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/436/ ) svn merge -c 1409525 FIXES: MAPREDUCE-4797 . LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409532) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409532 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Yarn-trunk #37 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/37/)
        MAPREDUCE-4797. LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-Yarn-trunk #37 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/37/ ) MAPREDUCE-4797 . LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk-Commit #3019 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3019/)
        MAPREDUCE-4797. LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Show
        Hudson added a comment - Integrated in Hadoop-trunk-Commit #3019 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3019/ ) MAPREDUCE-4797 . LocalContainerAllocator can loop forever trying to contact the RM (jlowe via bobby) (Revision 1409525) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1409525 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/local/LocalContainerAllocator.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/local/TestLocalContainerAllocator.java
        Hide
        Robert Joseph Evans added a comment -

        Thanks Jason,

        I put this into trunk, branch-2, and branch-0.23

        Show
        Robert Joseph Evans added a comment - Thanks Jason, I put this into trunk, branch-2, and branch-0.23
        Hide
        Robert Joseph Evans added a comment -

        Looks good +1, I'll check it in.

        Show
        Robert Joseph Evans added a comment - Looks good +1, I'll check it in.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12553572/MAPREDUCE-4797.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3032//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3032//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553572/MAPREDUCE-4797.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3032//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3032//console This message is automatically generated.
        Hide
        Jason Lowe added a comment -

        Of course, I should have just let the exception bubble up and fail the test directly rather than catching and failing. Updated patch accordingly.

        Show
        Jason Lowe added a comment - Of course, I should have just let the exception bubble up and fail the test directly rather than catching and failing. Updated patch accordingly.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12553564/MAPREDUCE-4797.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3031//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3031//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553564/MAPREDUCE-4797.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3031//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3031//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        The code looks good to me. My only comment is with the test. If a different exception is thrown, that exception is eaten and it is likely to be difficult to debug the error. If you could log it some how that would be great.

        Show
        Robert Joseph Evans added a comment - The code looks good to me. My only comment is with the test. If a different exception is thrown, that exception is eaten and it is likely to be difficult to debug the error. If you could log it some how that would be great.
        Hide
        Jason Lowe added a comment -

        Patch to fix the try block in the heartbeat method so we catch the exceptions being thrown when trying to contact the RM and can handle them for retry logic properly.

        Show
        Jason Lowe added a comment - Patch to fix the try block in the heartbeat method so we catch the exceptions being thrown when trying to contact the RM and can handle them for retry logic properly.
        Hide
        Jason Lowe added a comment -

        The code looks like it will only try to connect so many times before giving up, but there's a bug in LocalContainerAllocator.heartbeat:

        LocalContainerAllocator.heartbeat
        AllocateResponse allocateResponse = scheduler.allocate(allocateRequest);
        AMResponse response;
        try {
          response = allocateResponse.getAMResponse();
          // Reset retry count if no exception occurred.
          retrystartTime = System.currentTimeMillis();
        } catch (Exception e) {
        

        Note that the try block is surrounding the retrieval of the response after the allocate RPC call, so we're missing where the exception is really being thrown and not handling it here where it has retry count logic. The exception then bubbles up to the RMCommunicator allocator thread where if the exception isn't a YarnException then it simply loops around to try again, forever.

        Show
        Jason Lowe added a comment - The code looks like it will only try to connect so many times before giving up, but there's a bug in LocalContainerAllocator.heartbeat: LocalContainerAllocator.heartbeat AllocateResponse allocateResponse = scheduler.allocate(allocateRequest); AMResponse response; try { response = allocateResponse.getAMResponse(); // Reset retry count if no exception occurred. retrystartTime = System .currentTimeMillis(); } catch (Exception e) { Note that the try block is surrounding the retrieval of the response after the allocate RPC call, so we're missing where the exception is really being thrown and not handling it here where it has retry count logic. The exception then bubbles up to the RMCommunicator allocator thread where if the exception isn't a YarnException then it simply loops around to try again, forever.

          People

          • Assignee:
            Jason Lowe
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development