Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-196

Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha, 3.0.0-alpha1
    • Fix Version/s: 2.1.0-beta
    • Component/s: nodemanager
    • Labels:
      None

      Description

      If NM is started before starting the RM ,NM is shutting down with the following error

      ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting services org.apache.hadoop.yarn.server.nodemanager.NodeManager
      org.apache.avro.AvroRuntimeException: java.lang.reflect.UndeclaredThrowableException
      	at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149)
      	at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
      	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167)
      	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242)
      Caused by: java.lang.reflect.UndeclaredThrowableException
      	at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66)
      	at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182)
      	at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145)
      	... 3 more
      Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
      	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131)
      	at $Proxy23.registerNodeManager(Unknown Source)
      	at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
      	... 5 more
      Caused by: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
      	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1141)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1100)
      	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128)
      	... 7 more
      Caused by: java.net.ConnectException: Connection refused
      	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
      	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
      	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659)
      	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
      	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
      	at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
      	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1117)
      	... 9 more
      2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
      java.lang.InterruptedException
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
      	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
      	at java.lang.Thread.run(Thread.java:619)
      2012-01-16 15:04:13,337 INFO org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped.
      2012-01-16 15:04:13,392 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:9999
      2012-01-16 15:04:13,493 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped.
      2012-01-16 15:04:13,493 INFO org.apache.hadoop.ipc.Server: Stopping server on 24290
      2012-01-16 15:04:13,494 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 24290
      2012-01-16 15:04:13,495 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
      2012-01-16 15:04:13,496 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler is stopped.
      2012-01-16 15:04:13,496 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
      java.lang.InterruptedException
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
      	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
      	at java.lang.Thread.run(Thread.java:619)
      
      1. MAPREDUCE-3676.patch
        1 kB
        B Anil Kumar
      2. YARN-196.1.patch
        3 kB
        Xuan Gong
      3. YARN-196.2.patch
        3 kB
        Xuan Gong
      4. YARN-196.3.patch
        9 kB
        Xuan Gong
      5. YARN-196.4.patch
        11 kB
        Xuan Gong
      6. YARN-196.5.patch
        11 kB
        Xuan Gong
      7. YARN-196.6.patch
        11 kB
        Xuan Gong
      8. YARN-196.7.patch
        11 kB
        Xuan Gong
      9. YARN-196.8.patch
        11 kB
        Xuan Gong
      10. YARN-196.9.patch
        12 kB
        Xuan Gong
      11. YARN-196.10.patch
        12 kB
        Xuan Gong
      12. YARN-196.11.patch
        12 kB
        Xuan Gong
      13. YARN-196.12.patch
        12 kB
        Xuan Gong
      14. YARN-196.12.1.patch
        12 kB
        Hitesh Shah

        Issue Links

          Activity

          Hide
          bikassaha Bikas Saha added a comment -

          here is a finally block which will make the code sleeping for longer than necessary before exiting. this becomes important because admins might kill the NM after waiting for a few seconds for it to exit. In that much time NM has to do a bunch of clean up tasks and this extra sleep does not help.

          Show
          bikassaha Bikas Saha added a comment - here is a finally block which will make the code sleeping for longer than necessary before exiting. this becomes important because admins might kill the NM after waiting for a few seconds for it to exit. In that much time NM has to do a bunch of clean up tasks and this extra sleep does not help.
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1374 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1374/)
          YARN-196. Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038)

          Result = SUCCESS
          hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1374 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1374/ ) YARN-196 . Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038) Result = SUCCESS hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1346 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1346/)
          YARN-196. Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038)

          Result = FAILURE
          hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1346 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1346/ ) YARN-196 . Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038) Result = FAILURE hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #157 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/157/)
          YARN-196. Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038)

          Result = SUCCESS
          hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Yarn-trunk #157 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/157/ ) YARN-196 . Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038) Result = SUCCESS hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3479 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3479/)
          YARN-196. Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038)

          Result = SUCCESS
          hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038
          Files :

          • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
          • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-trunk-Commit #3479 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3479/ ) YARN-196 . Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started. Contributed by Xuan Gong. (Revision 1457038) Result = SUCCESS hitesh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1457038 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
          Hide
          hitesh Hitesh Shah added a comment -

          Committed to branch-2 and trunk. Thanks Xuan for addressing the numerous review comments and being so patient. I have also filed a related jira regarding similar handling of connection loss after the NM is up.

          Show
          hitesh Hitesh Shah added a comment - Committed to branch-2 and trunk. Thanks Xuan for addressing the numerous review comments and being so patient. I have also filed a related jira regarding similar handling of connection loss after the NM is up.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12573764/YARN-196.12.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/516//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/516//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12573764/YARN-196.12.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/516//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/516//console This message is automatically generated.
          Hide
          hitesh Hitesh Shah added a comment -

          @Xuan, latest patch looks good. Addressing a very minor nit and uploading it. Will commit as soon as jenkins does a +1.

          Show
          hitesh Hitesh Shah added a comment - @Xuan, latest patch looks good. Addressing a very minor nit and uploading it. Will commit as soon as jenkins does a +1.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12573196/YARN-196.12.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 eclipse:eclipse. The patch failed to build with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/497//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/497//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12573196/YARN-196.12.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. -1 eclipse:eclipse . The patch failed to build with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/497//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/497//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12573120/YARN-196.11.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/495//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/495//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12573120/YARN-196.11.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/495//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/495//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12572847/YARN-196.10.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 eclipse:eclipse. The patch failed to build with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/485//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/485//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12572847/YARN-196.10.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. -1 eclipse:eclipse . The patch failed to build with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/485//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/485//console This message is automatically generated.
          Hide
          xgong Xuan Gong added a comment -

          Fix format issue

          Show
          xgong Xuan Gong added a comment - Fix format issue
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12572834/YARN-196.9.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/483//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/483//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12572834/YARN-196.9.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/483//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/483//console This message is automatically generated.
          Hide
          hitesh Hitesh Shah added a comment -

          @Xuan, the previous comment still does not seem to have been addressed. Latest patch has:

          + throw new YarnException("Invalid Configuration. " +
          + "YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS " +
          + "should not be negative.");

          Show
          hitesh Hitesh Shah added a comment - @Xuan, the previous comment still does not seem to have been addressed. Latest patch has: + throw new YarnException("Invalid Configuration. " + + "YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS " + + "should not be negative.");
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12571982/YARN-196.8.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/464//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/464//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571982/YARN-196.8.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/464//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/464//console This message is automatically generated.
          Hide
          hitesh Hitesh Shah added a comment -

          + throw new YarnException("Invalid Configuration. " +
          + "RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS " +
          + "should not be negative.");

          Should replace "RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS " with YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS as a user will not understand anything if the log/exception has a variable name in it - we should use the property name defined in the configs as that provides a more clear explanation to the user. Likewise, fix the exception thrown later in the code as well as the log messages.

          Show
          hitesh Hitesh Shah added a comment - + throw new YarnException("Invalid Configuration. " + + "RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS " + + "should not be negative."); Should replace "RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS " with YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS as a user will not understand anything if the log/exception has a variable name in it - we should use the property name defined in the configs as that provides a more clear explanation to the user. Likewise, fix the exception thrown later in the code as well as the log messages.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12571927/YARN-196.7.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 eclipse:eclipse. The patch failed to build with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/460//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/460//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571927/YARN-196.7.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. -1 eclipse:eclipse . The patch failed to build with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/460//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/460//console This message is automatically generated.
          Hide
          xgong Xuan Gong added a comment -

          Uploading the new patch to validate the configuration values.

          Show
          xgong Xuan Gong added a comment - Uploading the new patch to validate the configuration values.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12571485/YARN-196.6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/448//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/448//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571485/YARN-196.6.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/448//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/448//console This message is automatically generated.
          Hide
          hitesh Hitesh Shah added a comment -

          Also, description in yarn-default.xml should mention the value is specified in seconds.

          Show
          hitesh Hitesh Shah added a comment - Also, description in yarn-default.xml should mention the value is specified in seconds.
          Hide
          hitesh Hitesh Shah added a comment -

          + public static final int DEFAULT_RESOURCEMANAGER_CONNECT_WAIT_SECS =
          + 15*60*1000;
          + public static final long DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS =
          + 30*1000;

          Variable says seconds but value is still in milliseconds?
          Likewise for yarn-default.xml

          + long rmConnectWaitMS =
          + conf.getInt(
          + YarnConfiguration.RESOURCEMANAGER_CONNECT_WAIT_SECS,
          + YarnConfiguration.DEFAULT_RESOURCEMANAGER_CONNECT_WAIT_SECS);
          + long rmConnectionRetryIntervalMS =
          + conf.getLong(
          + YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS,
          + YarnConfiguration.DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS);

          Above variables could be set using *1000 to keep code clean. Special handling needed for -1.

          Show
          hitesh Hitesh Shah added a comment - + public static final int DEFAULT_RESOURCEMANAGER_CONNECT_WAIT_SECS = + 15*60*1000; + public static final long DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS = + 30*1000; Variable says seconds but value is still in milliseconds? Likewise for yarn-default.xml + long rmConnectWaitMS = + conf.getInt( + YarnConfiguration.RESOURCEMANAGER_CONNECT_WAIT_SECS, + YarnConfiguration.DEFAULT_RESOURCEMANAGER_CONNECT_WAIT_SECS); + long rmConnectionRetryIntervalMS = + conf.getLong( + YarnConfiguration.RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS, + YarnConfiguration.DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_SECS); Above variables could be set using *1000 to keep code clean. Special handling needed for -1.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12570598/YARN-196.5.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/423//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/423//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12570598/YARN-196.5.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/423//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/423//console This message is automatically generated.
          Hide
          hitesh Hitesh Shah added a comment -

          Patch still has trailing whitespace issues.

          In YarnConfiguration.java:

          + /** Max time to wait for ResourceManger start
          + */

          Please rephrase to "max time to wait to establish a connection to the ResourceManager when the NodeManager starts"

          + /** Time interval for each NM attempt to connect RM
          + */

          Rephrase to "Time interval between each NM attempt to connect to the ResourceManager"

          You can use the same descriptions in yarn-default.xml. Not sure if "After that period of time, NM will throw out exceptions" is valid in yarn-default.xml. A better description could mention that the NM will shutdown if it cannot connect to the RM within the specified max time period.
          Description should also mention how to use -1 to retry forever.

          Earlier comment had a point of switching to using SECONDS instead of MS for users to understand more easily.

          + // this.hostName = InetAddress.getLocalHost().getCanonicalHostName();

          • Please remove commented out code if not being used.

          Unit test does not really seem to be testing the flow of RESOURCEMANAGER_CONNECT_WAIT_MS being set to -1. waitForEver is being explicitly set to true/false based on the updater's ctor and not really based on the config value. If that flow cannot be tested, it might be better to remove the additional complexity from the test.

          Also, patch will need to be updated due to https://issues.apache.org/jira/browse/HADOOP-9112.

          Show
          hitesh Hitesh Shah added a comment - Patch still has trailing whitespace issues. In YarnConfiguration.java: + /** Max time to wait for ResourceManger start + */ Please rephrase to "max time to wait to establish a connection to the ResourceManager when the NodeManager starts" + /** Time interval for each NM attempt to connect RM + */ Rephrase to "Time interval between each NM attempt to connect to the ResourceManager" You can use the same descriptions in yarn-default.xml. Not sure if "After that period of time, NM will throw out exceptions" is valid in yarn-default.xml. A better description could mention that the NM will shutdown if it cannot connect to the RM within the specified max time period. Description should also mention how to use -1 to retry forever. Earlier comment had a point of switching to using SECONDS instead of MS for users to understand more easily. + // this.hostName = InetAddress.getLocalHost().getCanonicalHostName(); Please remove commented out code if not being used. Unit test does not really seem to be testing the flow of RESOURCEMANAGER_CONNECT_WAIT_MS being set to -1. waitForEver is being explicitly set to true/false based on the updater's ctor and not really based on the config value. If that flow cannot be tested, it might be better to remove the additional complexity from the test. Also, patch will need to be updated due to https://issues.apache.org/jira/browse/HADOOP-9112 .
          Hide
          hitesh Hitesh Shah added a comment -

          @Xuan, the comment about the RM dying was for a later point in time. i.e. NM starts, connects to RM, does its work and then say 30 mins later, the RM dies.

          Thanks for the quick turnaround on the patch comments.

          Show
          hitesh Hitesh Shah added a comment - @Xuan, the comment about the RM dying was for a later point in time. i.e. NM starts, connects to RM, does its work and then say 30 mins later, the RM dies. Thanks for the quick turnaround on the patch comments.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12569621/YARN-196.4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/411//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/411//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12569621/YARN-196.4.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/411//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/411//console This message is automatically generated.
          Hide
          xgong Xuan Gong added a comment -

          bq: Also, we should check as to what the NM does if the RM dies at an intermediate point i.e registration succeeded but eventually at some point the RM died. Does the NM keep re-trying at this point or die after a certain time interval or after all containers finish? That bit could be looked at in a separate jira if you find any issues in that aspect.

          I do not think so. This block only checks:
          this.resourceTracker = getRMClient();//find resourceTracker
          regResponse = this.resourceTracker.registerNodeManager(request).getRegistrationResponse();//register the NM and get response.
          If those are getting correctly(without exceptions), we will get out of the whole loop. Even if RM will die later after we successfully get the NM registration and response, I do not think the NM will retrying it. This code is only for NM register with RM.

          Show
          xgong Xuan Gong added a comment - bq: Also, we should check as to what the NM does if the RM dies at an intermediate point i.e registration succeeded but eventually at some point the RM died. Does the NM keep re-trying at this point or die after a certain time interval or after all containers finish? That bit could be looked at in a separate jira if you find any issues in that aspect. I do not think so. This block only checks: this.resourceTracker = getRMClient();//find resourceTracker regResponse = this.resourceTracker.registerNodeManager(request).getRegistrationResponse();//register the NM and get response. If those are getting correctly(without exceptions), we will get out of the whole loop. Even if RM will die later after we successfully get the NM registration and response, I do not think the NM will retrying it. This code is only for NM register with RM.
          Hide
          hitesh Hitesh Shah added a comment -

          Fix trailing space issues in the patch. There also seem to be tabs in the patch.

          Some comments on the test code:

          • indentation issues in MyNodeStatusUpdater4
          • change code layout to put MyNodeStatusUpdater4 after MyNodeStatusUpdater3
          • fix MyNodeStatusUpdater4 to make waitIntervalMS a final field and passed in via the ctor.
          • likewise fo waitForEver
          • MyNodeStatusUpdater4::getRMClient does not use waitIntervalMS
          • could reuse waitIntervalMS var ( and reset to 20 seconds later ) so as to not rely on 30*1000 in the first check.
          • also, please look at reducing the time values so that the test run does not take too long. Current unit test takes over a minute to complete.

          Use "resourcemanager.connect.wait.ms" for NM_RESOURCEMANAGER_WAIT_MS to be more clear.
          Likewise for NM_RESOURCEMANAGER_RETRY_INTERVAL_MS, use "resourcemanager.connect.retry_interval.ms"
          Add both these new variables into yarn-default.xml with clear description.

          + LOG.info("Connecting to ResourceManager at " + this.rmAddress);

          • should be also log current attempt counter.

          + RegisterNodeManagerRequest request = recordFactory.newRecordInstance(RegisterNodeManagerRequest.class);
          + request.setHttpPort(this.httpPort);
          + request.setResource(this.totalResource);
          + request.setNodeId(this.nodeId);

          • this object does not need to be created for each loop. Could be created once outside.

          Also, we should check as to what the NM does if the RM dies at an intermediate point i.e registration succeeded but eventually at some point the RM died. Does the NM keep re-trying at this point or die after a certain time interval or after all containers finish? That bit could be looked at in a separate jira if you find any issues in that aspect.

          Show
          hitesh Hitesh Shah added a comment - Fix trailing space issues in the patch. There also seem to be tabs in the patch. Some comments on the test code: indentation issues in MyNodeStatusUpdater4 change code layout to put MyNodeStatusUpdater4 after MyNodeStatusUpdater3 fix MyNodeStatusUpdater4 to make waitIntervalMS a final field and passed in via the ctor. likewise fo waitForEver MyNodeStatusUpdater4::getRMClient does not use waitIntervalMS could reuse waitIntervalMS var ( and reset to 20 seconds later ) so as to not rely on 30*1000 in the first check. also, please look at reducing the time values so that the test run does not take too long. Current unit test takes over a minute to complete. Use "resourcemanager.connect.wait.ms" for NM_RESOURCEMANAGER_WAIT_MS to be more clear. Likewise for NM_RESOURCEMANAGER_RETRY_INTERVAL_MS, use "resourcemanager.connect.retry_interval.ms" Add both these new variables into yarn-default.xml with clear description. + LOG.info("Connecting to ResourceManager at " + this.rmAddress); should be also log current attempt counter. + RegisterNodeManagerRequest request = recordFactory.newRecordInstance(RegisterNodeManagerRequest.class); + request.setHttpPort(this.httpPort); + request.setResource(this.totalResource); + request.setNodeId(this.nodeId); this object does not need to be created for each loop. Could be created once outside. Also, we should check as to what the NM does if the RM dies at an intermediate point i.e registration succeeded but eventually at some point the RM died. Does the NM keep re-trying at this point or die after a certain time interval or after all containers finish? That bit could be looked at in a separate jira if you find any issues in that aspect.
          Hide
          hitesh Hitesh Shah added a comment -

          Sorry - ran through the patch too quickly. Xuan, please ignore my comment on addressing the moving of code to the
          registerWithRM function.

          Show
          hitesh Hitesh Shah added a comment - Sorry - ran through the patch too quickly. Xuan, please ignore my comment on addressing the moving of code to the registerWithRM function.
          Hide
          hitesh Hitesh Shah added a comment -

          More comments:

          • Please restrict lines to 80 chars.
          • boolean waitForEver = (rmWaitMS==-1)? ... could be re-written as "boolean waitForEver = (rmWaitMS == -1);"
          • we could change WAIT_MS and RETRY_INTERVAL_MS to WAIT_SECS and RETRY_PERIOD_SECS as the normal scale used is seconds?

          the current patch seems to be catching all exceptions. This will cause a problem in the case where the RM explicitly asks the NM to shutdown - maybe it makes sense to move the retry logic into the registerWithRM function?

          The above comment for patch-2 does not seem to have been addressed in the latest patch.

          +          String errorMessage = "\nFailed to Connect to RM, no. of failed attemps is "+rmRetryCount;
          +          LOG.error(e+errorMessage);
          +          throw new YarnException(e+errorMessage);
          
          • Please do not use \n. It is not platform independent.
          • You can use the LOG.error(message, throwable) api.
          • Likewise you could use YarnException(String message, Throwable cause)
          Show
          hitesh Hitesh Shah added a comment - More comments: Please restrict lines to 80 chars. boolean waitForEver = (rmWaitMS==-1)? ... could be re-written as "boolean waitForEver = (rmWaitMS == -1);" we could change WAIT_MS and RETRY_INTERVAL_MS to WAIT_SECS and RETRY_PERIOD_SECS as the normal scale used is seconds? the current patch seems to be catching all exceptions. This will cause a problem in the case where the RM explicitly asks the NM to shutdown - maybe it makes sense to move the retry logic into the registerWithRM function? The above comment for patch-2 does not seem to have been addressed in the latest patch. + String errorMessage = "\nFailed to Connect to RM, no. of failed attemps is " +rmRetryCount; + LOG.error(e+errorMessage); + throw new YarnException(e+errorMessage); Please do not use \n. It is not platform independent. You can use the LOG.error(message, throwable) api. Likewise you could use YarnException(String message, Throwable cause)
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12568891/YARN-196.3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/401//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/401//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568891/YARN-196.3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/401//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/401//console This message is automatically generated.
          Hide
          hitesh Hitesh Shah added a comment -

          Xuan, some comments:

          • testNMShutdownForRegistrationFailure tests for an explicit command from the RM telling the NM to shut down.
          • failing within 10 seconds seems too quick. The rpc layer internally retries for a certain time period. From the NM layer, we should probably have a total time length defined - say 15 mins and retry after 30 seconds or so within that time period.
          • also someone should be able to set the time period to -1 to disable the upper bound and retry forever if needed.
          • use same conventions as used elsewhere when naming variables - rm_Retry_interval_ms does not confirm to the standards defined in the class.
          • "LOG.debug("Fail to connect to RM");" - change to error and log the exception stack trace unless it is being caught elsewhere and being printed. It would also help to log how many retries were attempted before failing out.
          • in the start() function, there is an AvroRuntimeException being thrown - we should replace that with YarnException or an appropriate runtime exception.
          • isRMStarted var is not needed - a simple break in the loop if the registration is done should suffice.
          • please remove the space in "rm_Retry_Count --;"
          • the debug log message at the end of the loop should be set to use WARN level. Also, please re-phrase it for more clarity - something along the lines of Retrying connecting to RM, current no. of failed attempts ...
          • the current patch seems to be catching all exceptions. This will cause a problem in the case where the RM explicitly asks the NM to shutdown - maybe it makes sense to move the retry logic into the registerWithRM function?
          Show
          hitesh Hitesh Shah added a comment - Xuan, some comments: testNMShutdownForRegistrationFailure tests for an explicit command from the RM telling the NM to shut down. failing within 10 seconds seems too quick. The rpc layer internally retries for a certain time period. From the NM layer, we should probably have a total time length defined - say 15 mins and retry after 30 seconds or so within that time period. also someone should be able to set the time period to -1 to disable the upper bound and retry forever if needed. use same conventions as used elsewhere when naming variables - rm_Retry_interval_ms does not confirm to the standards defined in the class. "LOG.debug("Fail to connect to RM");" - change to error and log the exception stack trace unless it is being caught elsewhere and being printed. It would also help to log how many retries were attempted before failing out. in the start() function, there is an AvroRuntimeException being thrown - we should replace that with YarnException or an appropriate runtime exception. isRMStarted var is not needed - a simple break in the loop if the registration is done should suffice. please remove the space in "rm_Retry_Count --;" the debug log message at the end of the loop should be set to use WARN level. Also, please re-phrase it for more clarity - something along the lines of Retrying connecting to RM, current no. of failed attempts ... the current patch seems to be catching all exceptions. This will cause a problem in the case where the RM explicitly asks the NM to shutdown - maybe it makes sense to move the retry logic into the registerWithRM function?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12568531/YARN-196.2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/397//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/397//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568531/YARN-196.2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/397//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/397//console This message is automatically generated.
          Hide
          xgong Xuan Gong added a comment -

          Fix the test failure.
          testNMShutdownForRegistrationFailure will cover this update

          Show
          xgong Xuan Gong added a comment - Fix the test failure. testNMShutdownForRegistrationFailure will cover this update
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12568507/YARN-196.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

          org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/396//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/396//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568507/YARN-196.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/396//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/396//console This message is automatically generated.
          Hide
          subrotosanyal Subroto Sanyal added a comment -

          Met with the issue today while booting up cdh4b2 in our cluster ....

          Show
          subrotosanyal Subroto Sanyal added a comment - Met with the issue today while booting up cdh4b2 in our cluster ....
          Hide
          akumarb2010 B Anil Kumar added a comment -

          Thanks Arun for your input. Is it ok to introduce configuration parameters with default values like

          Configuration Parameter Default Value
          NM_PREFIX + resourcemanager.retry.count 10
          NM_PREFIX + resourcemanager.retry.interval.ms 1000
          Show
          akumarb2010 B Anil Kumar added a comment - Thanks Arun for your input. Is it ok to introduce configuration parameters with default values like Configuration Parameter Default Value NM_PREFIX + resourcemanager.retry.count 10 NM_PREFIX + resourcemanager.retry.interval.ms 1000
          Hide
          acmurthy Arun C Murthy added a comment -

          Re-trying for a certain number of times before giving up seems reasonable.

          Show
          acmurthy Arun C Murthy added a comment - Re-trying for a certain number of times before giving up seems reasonable.
          Hide
          akumarb2010 B Anil Kumar added a comment -

          Currently, If NM starts before RM, then NM gets shutdown due to registration failure. But I think, an NM can start before RM, so instead of lettting NM down, it would be better to retry RM, till it is up.

          The same has been provided as patch. Please let me know, whether this approach is OK or not.

          And moreover second case, retrying RM continuously, after RM and NM started and RM's unavailabilty, seems
          correct.

          Show
          akumarb2010 B Anil Kumar added a comment - Currently, If NM starts before RM, then NM gets shutdown due to registration failure. But I think , an NM can start before RM, so instead of lettting NM down, it would be better to retry RM, till it is up. The same has been provided as patch. Please let me know, whether this approach is OK or not. And moreover second case, retrying RM continuously, after RM and NM started and RM's unavailabilty, seems correct.
          Hide
          akumarb2010 B Anil Kumar added a comment -

          When NM is started before RM, I think NM gets YarnRemoteException (due to network unreachability) while trying to register with RM.

          However when RM goes down, NM tries forever. I think, NM should retry only yarn.nm.liveness-monitor.expiry-interval-ms amount of time, and then it should get shudown, as there is no use of retrying after this amount time.

          any comments?

          Show
          akumarb2010 B Anil Kumar added a comment - When NM is started before RM, I think NM gets YarnRemoteException (due to network unreachability) while trying to register with RM. However when RM goes down, NM tries forever. I think, NM should retry only yarn.nm.liveness-monitor.expiry-interval-ms amount of time, and then it should get shudown, as there is no use of retrying after this amount time. any comments?

            People

            • Assignee:
              xgong Xuan Gong
              Reporter:
              ramgopalnaali Ramgopal N
            • Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development