Hadoop Common
  1. Hadoop Common
  2. HADOOP-7454 Common side of High Availability Framework (HDFS-1623)
  3. HADOOP-7896

HA: if both NNs are in Standby mode, client needs to try failing back and forth several times with sleeps

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: HA Branch (HDFS-1623)
    • Fix Version/s: HA Branch (HDFS-1623)
    • Component/s: ha, ipc
    • Labels:
      None

      Description

      For a manual failover, there may be an intermediate state for a non-trivial amount of time where both NNs are in standby mode. Currently, the failover proxy will immediately failover on receiving this exception from the first NN, and when it hits the same exception on the second NN, it immediately fails. It should probably fail back and forth nearly indefinitely if both NNs are in Standby mode.

      1. HADOOP-7896-HDFS-1623.patch
        14 kB
        Aaron T. Myers
      2. HADOOP-7896-HDFS-1623.patch
        17 kB
        Aaron T. Myers
      3. HADOOP-7896-HDFS-1623.patch
        17 kB
        Aaron T. Myers

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #92 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/92/)
          HADOOP-8116. RetriableCommand is using RetryPolicy incorrectly after HADOOP-7896. Contributed by Aaron T. Myers. (Revision 1294729)

          Result = UNSTABLE
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294729
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/RetriableCommand.java
          • /hadoop/common/branches/HDFS-1623/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #92 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/92/ ) HADOOP-8116 . RetriableCommand is using RetryPolicy incorrectly after HADOOP-7896 . Contributed by Aaron T. Myers. (Revision 1294729) Result = UNSTABLE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294729 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/RetriableCommand.java /hadoop/common/branches/ HDFS-1623 /hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java
          Aaron T. Myers made changes -
          Link This issue relates to HDFS-2005 [ HDFS-2005 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/16/)
          HADOOP-7896. HA: if both NNs are in Standby mode, client needs to try failing back and forth several times with sleeps. Contributed by Aaron T. Myers

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214076
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicy.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ThreadUtil.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/TestFailoverProxy.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/UnreliableImplementation.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #16 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/16/ ) HADOOP-7896 . HA: if both NNs are in Standby mode, client needs to try failing back and forth several times with sleeps. Contributed by Aaron T. Myers atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1214076 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicy.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ThreadUtil.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/TestFailoverProxy.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/retry/UnreliableImplementation.java
          Aaron T. Myers made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s HA Branch (HDFS-1623) [ 12317569 ]
          Resolution Fixed [ 1 ]
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for the reviews, Eli and Todd. I've just committed this.

          Show
          Aaron T. Myers added a comment - Thanks a lot for the reviews, Eli and Todd. I've just committed this.
          Hide
          Eli Collins added a comment -

          +1 lgtm

          Show
          Eli Collins added a comment - +1 lgtm
          Aaron T. Myers made changes -
          Attachment HADOOP-7896-HDFS-1623.patch [ 12507296 ]
          Hide
          Aaron T. Myers added a comment -

          Thanks again for the review, Eli. I agree that that synchronization is unnecessary. Here's an updated patch which removes that sync.

          I'm going to commit this momentarily unless there are further objections.

          Show
          Aaron T. Myers added a comment - Thanks again for the review, Eli. I agree that that synchronization is unnecessary. Here's an updated patch which removes that sync. I'm going to commit this momentarily unless there are further objections.
          Hide
          Eli Collins added a comment -

          I dont't think you need to synchronize the sleep RetryInvocationHandler#invoke, otherwise looks great.

          Show
          Eli Collins added a comment - I dont't think you need to synchronize the sleep RetryInvocationHandler#invoke, otherwise looks great.
          Hide
          Todd Lipcon added a comment -

          +1, looks good from me.

          Show
          Todd Lipcon added a comment - +1, looks good from me.
          Aaron T. Myers made changes -
          Attachment HADOOP-7896-HDFS-1623.patch [ 12507283 ]
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for the review, Eli. Here's an updated patch which addresses your comments.

          Show
          Aaron T. Myers added a comment - Thanks a lot for the review, Eli. Here's an updated patch which addresses your comments.
          Hide
          Eli Collins added a comment -
          • In calculateExponentialTime this should be *, otherwise you'll get +/- 0%-150%
            RAND.nextFloat() + 0.5
            
          • In RetryInvocationHandler#invoke I'd pull the sleep loop out to a Util method, eg sleepAtLeastIgnoreInterrupts (since it's useful elsewhere and we may sleep 2x delayMillis)
          • testFailoverBetweenMultipleStandbys needs a javadoc

          Otherwise looks great.

          Show
          Eli Collins added a comment - In calculateExponentialTime this should be *, otherwise you'll get +/- 0%-150% RAND.nextFloat() + 0.5 In RetryInvocationHandler#invoke I'd pull the sleep loop out to a Util method, eg sleepAtLeastIgnoreInterrupts (since it's useful elsewhere and we may sleep 2x delayMillis) testFailoverBetweenMultipleStandbys needs a javadoc Otherwise looks great.
          Todd Lipcon made changes -
          Component/s ha [ 12316608 ]
          Aaron T. Myers made changes -
          Field Original Value New Value
          Attachment HADOOP-7896-HDFS-1623.patch [ 12507124 ]
          Hide
          Aaron T. Myers added a comment -

          Here's a patch which addresses the issue.

          Show
          Aaron T. Myers added a comment - Here's a patch which addresses the issue.
          Hide
          Steve Loughran added a comment -

          sleeps with a bit of random jitter, I hope, possibly even exponential backoff

          Show
          Steve Loughran added a comment - sleeps with a bit of random jitter, I hope, possibly even exponential backoff
          Todd Lipcon created issue -

            People

            • Assignee:
              Aaron T. Myers
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development