Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: HA branch (HDFS-1623)
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We need to close out the NN operations categories.

      The following operations should be left as is, ie not failover, as it's reasonable to call these on a standby, and we just need to update the TODO with a comment:

      • setSafeMode (Might want to force the standby out of safemode)
      • restoreFailedStorage (Might want to tell the standby to restore the shared edits dir)
      • saveNamespace, metaSave (Could imagine calling these on a standby eg in a recovery scenario)
      • refreshNodes (Decommissioning needs to refresh the standby)

      The following operations should be checked for READ, as neither should need to be called on standby, will failover unless stale reads are enabled:

      • getTransactionID, getEditLogManifest (we don't checkoint the standby)

      The following operations should be checked for WRITE, as they should not be called on a standby, ie should always failover:

      • finalizeUpgrade, distributedUpgradeProgress (should not be able to upgrade the standby)
      • setBalancerBandwidth (balancer should failover)
      1. hdfs-2922.txt
        6 kB
        Eli Collins
      2. hdfs-2922.txt
        6 kB
        Eli Collins
      3. hdfs-2922.txt
        6 kB
        Eli Collins
      4. hdfs-2922.txt
        8 kB
        Eli Collins

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-HAbranch-build #86 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/86/)
        HDFS-2922. HA: close out operation categories. Contributed by Eli Collins (Revision 1292620)

        Result = UNSTABLE
        eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292620
        Files :

        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/BackupNode.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyState.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #86 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/86/ ) HDFS-2922 . HA: close out operation categories. Contributed by Eli Collins (Revision 1292620) Result = UNSTABLE eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292620 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/BackupNode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyState.java
        Hide
        Eli Collins added a comment -

        I've committed this. Thanks for the reviews Todd and ATM.

        Show
        Eli Collins added a comment - I've committed this. Thanks for the reviews Todd and ATM.
        Hide
        Aaron T. Myers added a comment -

        +1, the latest patch looks good to me. Thanks a lot for making these changes, Eli.

        Show
        Aaron T. Myers added a comment - +1, the latest patch looks good to me. Thanks a lot for making these changes, Eli.
        Hide
        Eli Collins added a comment -

        Forgot to mention, I'm re-running the hdfs tests for sanity.

        Show
        Eli Collins added a comment - Forgot to mention, I'm re-running the hdfs tests for sanity.
        Hide
        Eli Collins added a comment -

        ATM - thanks for the excellent feedback.

        #1 Done, no long necessary post Todd's change
        #2 Great suggestion, added an UNCHECKED op category for state-agnostic operations
        #3 Reasonable, done.
        #4 Those fixed spelling mistakes, I removed that part of the diff

        Show
        Eli Collins added a comment - ATM - thanks for the excellent feedback. #1 Done, no long necessary post Todd's change #2 Great suggestion, added an UNCHECKED op category for state-agnostic operations #3 Reasonable, done. #4 Those fixed spelling mistakes, I removed that part of the diff
        Hide
        Aaron T. Myers added a comment -

        Patch largely looks good, Eli. A few comments:

        1. I think you should remove the change to the logging in the event of failovers. I was under the impression that Todd had addressed your concerns with the addition of the "worthLogging" boolean which causes a message to be logged only in the event of a client failover after a previous successful RPC call.
        2. Rather than just add a comment saying we're not checking the op category in the case of administrative commands, let's add another op category explicitly for administrative commands that could reasonably be run on either NN, and adjust StandbyState#checkOperation accordingly.
        3. I think the operation category for getTransactionID should be CHECKPOINT, since that's what it's used for.
        4. There are 2 lines changed in comments that have no discernible change. Errant whitespace changes?
        Show
        Aaron T. Myers added a comment - Patch largely looks good, Eli. A few comments: I think you should remove the change to the logging in the event of failovers. I was under the impression that Todd had addressed your concerns with the addition of the "worthLogging" boolean which causes a message to be logged only in the event of a client failover after a previous successful RPC call. Rather than just add a comment saying we're not checking the op category in the case of administrative commands, let's add another op category explicitly for administrative commands that could reasonably be run on either NN, and adjust StandbyState#checkOperation accordingly. I think the operation category for getTransactionID should be CHECKPOINT, since that's what it's used for. There are 2 lines changed in comments that have no discernible change. Errant whitespace changes?
        Hide
        Eli Collins added a comment -

        Thanks Todd. Updated patch, modified getDatanodeReport to no longer check operation category, which is fine, it's reasonly for DFSAdmin#printTopology and DFSAdmin#report to not failover. I've run the full hdfs test suite with this patch.

        Show
        Eli Collins added a comment - Thanks Todd. Updated patch, modified getDatanodeReport to no longer check operation category, which is fine, it's reasonly for DFSAdmin#printTopology and DFSAdmin#report to not failover. I've run the full hdfs test suite with this patch.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-HAbranch-build #75 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/75/)
        Revert HDFS-2922 via svn merge -c -1242572

        The patch broke a lot of unit tests in the nightly build. Will recommit after it is fixed.

        todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1242874
        Files :

        • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #75 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/75/ ) Revert HDFS-2922 via svn merge -c -1242572 The patch broke a lot of unit tests in the nightly build. Will recommit after it is fixed. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1242874 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
        Hide
        Todd Lipcon added a comment -

        This seems to have broken a bunch of tests since getDatanodeReport now throws an error on the standby, so MiniDFSCluster won't start an HA cluster anymore. I'm going to revert for now to get tests passing again on branch

        Show
        Todd Lipcon added a comment - This seems to have broken a bunch of tests since getDatanodeReport now throws an error on the standby, so MiniDFSCluster won't start an HA cluster anymore. I'm going to revert for now to get tests passing again on branch
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-HAbranch-build #74 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/74/)
        HDFS-2922. HA: close out operation categories. Contributed by Eli Collins

        eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1242572
        Files :

        • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #74 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/74/ ) HDFS-2922 . HA: close out operation categories. Contributed by Eli Collins eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1242572 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
        Hide
        Eli Collins added a comment -

        Thanks for the reviews. I've committed this.

        Show
        Eli Collins added a comment - Thanks for the reviews. I've committed this.
        Hide
        Todd Lipcon added a comment -

        +1

        Show
        Todd Lipcon added a comment - +1
        Hide
        Eli Collins added a comment -

        Yup, I see your points. Updated patch.

        Show
        Eli Collins added a comment - Yup, I see your points. Updated patch.
        Hide
        Suresh Srinivas added a comment -

        I agree with Todd. We should treat setBalancerBandwidth() same as refreshNodes(). Why do unnecessarily failover for now?

        Show
        Suresh Srinivas added a comment - I agree with Todd. We should treat setBalancerBandwidth() same as refreshNodes(). Why do unnecessarily failover for now?
        Hide
        Eli Collins added a comment -

        Patch attached. Updates operations per the description (most are unchanged). Also:

        • Remove warn in RetryInvocationHandler so it's just a debug so we don't log on every operation to an active that we failed over to
        • Fix a couple spelling mistakes
        Show
        Eli Collins added a comment - Patch attached. Updates operations per the description (most are unchanged). Also: Remove warn in RetryInvocationHandler so it's just a debug so we don't log on every operation to an active that we failed over to Fix a couple spelling mistakes
        Hide
        Todd Lipcon added a comment -

        setBalancerBandwidth (a somewhat bizarre RPC) would actually be called on both nodes, ideally... but I agree for now it should probably do a failover, with an open task to to treat it similar to refreshNodes() – ie automatically call on all NNs.

        Show
        Todd Lipcon added a comment - setBalancerBandwidth (a somewhat bizarre RPC) would actually be called on both nodes, ideally... but I agree for now it should probably do a failover, with an open task to to treat it similar to refreshNodes() – ie automatically call on all NNs.

          People

          • Assignee:
            Eli Collins
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development