Hadoop Common
  1. Hadoop Common
  2. HADOOP-8236

haadmin should have configurable timeouts for failover commands

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.3
    • Fix Version/s: 2.0.0-alpha
    • Component/s: ha
    • Labels:
      None

      Description

      The HAAdmin failover could should time out reasonably aggressively and go onto the fencing strategies if it's dealing with a mostly dead active namenode. Currently it uses what's probably the default, which is to say no timeout whatsoever.

        /**
         * Return a proxy to the specified target service.
         */
        protected HAServiceProtocol getProtocol(String serviceId)
            throws IOException {
          String serviceAddr = getServiceAddr(serviceId);
          InetSocketAddress addr = NetUtils.createSocketAddr(serviceAddr);
          return (HAServiceProtocol)RPC.getProxy(
                HAServiceProtocol.class, HAServiceProtocol.versionID,
                addr, getConf());
        }
      
      1. hadoop-8236.txt
        20 kB
        Todd Lipcon
      2. hadoop-8236.txt
        20 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Attached patch adds the following timeouts:

          key default purpose
          ha.failover-controller.new-active.rpc-timeout.ms 60s timeout for asking the new active to become active
          ha.failover-controller.graceful-fence.rpc-timeout.ms 5s timeout for asking the old active to become standby, before fencing
          ha.failover-controller.cli-check.rpc-timeout.ms 20s timeout for CLI commands like -getServiceState, -monitorHealth

          Philip, do you think these seem reasonable?

          I tested this manually by kill -STOPping my namenode and then initiating a failover. The RPC timed out after 5s and then fenced the stopped NN before doing the failover.

          Show
          Todd Lipcon added a comment - Attached patch adds the following timeouts: key default purpose ha.failover-controller.new-active.rpc-timeout.ms 60s timeout for asking the new active to become active ha.failover-controller.graceful-fence.rpc-timeout.ms 5s timeout for asking the old active to become standby, before fencing ha.failover-controller.cli-check.rpc-timeout.ms 20s timeout for CLI commands like -getServiceState, -monitorHealth Philip, do you think these seem reasonable? I tested this manually by kill -STOPping my namenode and then initiating a failover. The RPC timed out after 5s and then fenced the stopped NN before doing the failover.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12520711/hadoop-8236.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.fs.viewfs.TestViewFsTrash

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/807//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/807//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520711/hadoop-8236.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.fs.viewfs.TestViewFsTrash +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/807//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/807//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Todd,
          These timeouts look reasonable to me. Worth noting that new-active is also the timeout for the active pre-check, ie the check that the new active is alive and well before we ask the current active to go standby. This is important because we don't want to impatiently wait 5s before fencing then wait a minute to make the new active active. In practice since we already contacted the new active we probably won't have to wait 60s to transition it to active unless something happened in between the pre-check and the transition to active, which is why 60s timeout here is reasonable.

          Nit: can remove the "TODO" before transitionToActive since this is now configurable.
          Otherwise patch looks great.

          Show
          Eli Collins added a comment - Todd, These timeouts look reasonable to me. Worth noting that new-active is also the timeout for the active pre-check, ie the check that the new active is alive and well before we ask the current active to go standby. This is important because we don't want to impatiently wait 5s before fencing then wait a minute to make the new active active. In practice since we already contacted the new active we probably won't have to wait 60s to transition it to active unless something happened in between the pre-check and the transition to active, which is why 60s timeout here is reasonable. Nit: can remove the "TODO" before transitionToActive since this is now configurable. Otherwise patch looks great.
          Hide
          Todd Lipcon added a comment -

          New patch removes the TODO. I'll commit this version momentarily.

          Show
          Todd Lipcon added a comment - New patch removes the TODO. I'll commit this version momentarily.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12520865/hadoop-8236.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.fs.viewfs.TestViewFsTrash

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/810//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/810//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520865/hadoop-8236.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.fs.viewfs.TestViewFsTrash +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/810//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/810//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Committed to branch-2 and trunk. Thanks for reviewing.

          Show
          Todd Lipcon added a comment - Committed to branch-2 and trunk. Thanks for reviewing.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2043 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2043/)
          HADOOP-8236. haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235
          Files :

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2043 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2043/ ) HADOOP-8236 . haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235 Files : /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #1968 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1968/)
          HADOOP-8236. haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235)

          Result = SUCCESS
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235
          Files :

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1968 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1968/ ) HADOOP-8236 . haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235) Result = SUCCESS todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235 Files : /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1980 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1980/)
          HADOOP-8236. haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235)

          Result = ABORTED
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235
          Files :

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1980 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1980/ ) HADOOP-8236 . haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235) Result = ABORTED todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235 Files : /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1003 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1003/)
          HADOOP-8236. haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235)

          Result = FAILURE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235
          Files :

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1003 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1003/ ) HADOOP-8236 . haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235 Files : /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1038 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1038/)
          HADOOP-8236. haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235)

          Result = FAILURE
          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235
          Files :

          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
          • /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1038 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1038/ ) HADOOP-8236 . haadmin should have configurable timeouts for failover commands. Contributed by Todd Lipcon. (Revision 1308235) Result = FAILURE todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308235 Files : /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeys.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/FailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Philip Zeyliger
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development