Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-7454 Common side of High Availability Framework (HDFS-1623)
  3. HADOOP-7896

HA: if both NNs are in Standby mode, client needs to try failing back and forth several times with sleeps

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: HA Branch (HDFS-1623)
    • Fix Version/s: HA Branch (HDFS-1623)
    • Component/s: ha, ipc
    • Labels:
      None

      Description

      For a manual failover, there may be an intermediate state for a non-trivial amount of time where both NNs are in standby mode. Currently, the failover proxy will immediately failover on receiving this exception from the first NN, and when it hits the same exception on the second NN, it immediately fails. It should probably fail back and forth nearly indefinitely if both NNs are in Standby mode.

      1. HADOOP-7896-HDFS-1623.patch
        14 kB
        Aaron T. Myers
      2. HADOOP-7896-HDFS-1623.patch
        17 kB
        Aaron T. Myers
      3. HADOOP-7896-HDFS-1623.patch
        17 kB
        Aaron T. Myers

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          sleeps with a bit of random jitter, I hope, possibly even exponential backoff

          Show
          stevel@apache.org Steve Loughran added a comment - sleeps with a bit of random jitter, I hope, possibly even exponential backoff
          Hide
          atm Aaron T. Myers added a comment -

          Here's a patch which addresses the issue.

          Show
          atm Aaron T. Myers added a comment - Here's a patch which addresses the issue.
          Hide
          eli Eli Collins added a comment -
          • In calculateExponentialTime this should be *, otherwise you'll get +/- 0%-150%
            RAND.nextFloat() + 0.5
            
          • In RetryInvocationHandler#invoke I'd pull the sleep loop out to a Util method, eg sleepAtLeastIgnoreInterrupts (since it's useful elsewhere and we may sleep 2x delayMillis)
          • testFailoverBetweenMultipleStandbys needs a javadoc

          Otherwise looks great.

          Show
          eli Eli Collins added a comment - In calculateExponentialTime this should be *, otherwise you'll get +/- 0%-150% RAND.nextFloat() + 0.5 In RetryInvocationHandler#invoke I'd pull the sleep loop out to a Util method, eg sleepAtLeastIgnoreInterrupts (since it's useful elsewhere and we may sleep 2x delayMillis) testFailoverBetweenMultipleStandbys needs a javadoc Otherwise looks great.
          Hide
          atm Aaron T. Myers added a comment -

          Thanks a lot for the review, Eli. Here's an updated patch which addresses your comments.

          Show
          atm Aaron T. Myers added a comment - Thanks a lot for the review, Eli. Here's an updated patch which addresses your comments.
          Hide
          tlipcon Todd Lipcon added a comment -

          +1, looks good from me.

          Show
          tlipcon Todd Lipcon added a comment - +1, looks good from me.
          Hide
          eli Eli Collins added a comment -

          I dont't think you need to synchronize the sleep RetryInvocationHandler#invoke, otherwise looks great.

          Show
          eli Eli Collins added a comment - I dont't think you need to synchronize the sleep RetryInvocationHandler#invoke, otherwise looks great.
          Hide
          atm Aaron T. Myers added a comment -

          Thanks again for the review, Eli. I agree that that synchronization is unnecessary. Here's an updated patch which removes that sync.

          I'm going to commit this momentarily unless there are further objections.

          Show
          atm Aaron T. Myers added a comment - Thanks again for the review, Eli. I agree that that synchronization is unnecessary. Here's an updated patch which removes that sync. I'm going to commit this momentarily unless there are further objections.
          Hide
          eli Eli Collins added a comment -

          +1 lgtm

          Show
          eli Eli Collins added a comment - +1 lgtm
          Hide
          atm Aaron T. Myers added a comment -

          Thanks a lot for the reviews, Eli and Todd. I've just committed this.

          Show
          atm Aaron T. Myers added a comment - Thanks a lot for the reviews, Eli and Todd. I've just committed this.
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/16/)
          HADOOP-7896. HA: if both NNs are in Standby mode, client needs to try failing back and forth several times with sleeps. Contributed by Aaron T. Myers

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214076
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicy.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ThreadUtil.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/TestFailoverProxy.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/UnreliableImplementation.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/16/ ) HADOOP-7896 . HA: if both NNs are in Standby mode, client needs to try failing back and forth several times with sleeps. Contributed by Aaron T. Myers atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214076 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicy.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ThreadUtil.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/TestFailoverProxy.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/UnreliableImplementation.java
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #92 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/92/)
          HADOOP-8116. RetriableCommand is using RetryPolicy incorrectly after HADOOP-7896. Contributed by Aaron T. Myers. (Revision 1294729)

          Result = UNSTABLE
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294729
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/RetriableCommand.java
          • /hadoop/common/branches/HDFS-1623/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #92 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/92/ ) HADOOP-8116 . RetriableCommand is using RetryPolicy incorrectly after HADOOP-7896 . Contributed by Aaron T. Myers. (Revision 1294729) Result = UNSTABLE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294729 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/RetriableCommand.java /hadoop/common/branches/ HDFS-1623 /hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java

            People

            • Assignee:
              atm Aaron T. Myers
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development