Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha, namenode
    • Labels:
      None

      Description

      When the trunk moves to protobuf based RPC, HAServiceProtocol should have equivalent protobuf implementation.

      1. HDFS-2586.txt
        24 kB
        Aaron T. Myers
      2. HDFS-2586.txt
        24 kB
        Suresh Srinivas
      3. HDFS-2586.txt
        24 kB
        Suresh Srinivas
      4. HDFS-2586.txt
        26 kB
        Suresh Srinivas

        Issue Links

          Activity

          Hide
          atm Aaron T. Myers added a comment -

          Suresh, any updates on this?

          Show
          atm Aaron T. Myers added a comment - Suresh, any updates on this?
          Hide
          sureshms Suresh Srinivas added a comment -

          Attached patch implements protobuf implementation for HAServiceProtocol

          Show
          sureshms Suresh Srinivas added a comment - Attached patch implements protobuf implementation for HAServiceProtocol
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Please add KerberosInfo annotation to HAServiceProtocolPB.
          Changes to HdfsServerConstants.java and BackupNode.java don't seem to be related.

          +1 otherwise.

          Show
          jnp Jitendra Nath Pandey added a comment - Please add KerberosInfo annotation to HAServiceProtocolPB. Changes to HdfsServerConstants.java and BackupNode.java don't seem to be related. +1 otherwise.
          Hide
          atm Aaron T. Myers added a comment -

          Can you please test this manually on a real cluster before committing it? Just running the various `hdfs haadmin' commands with and without security should be sufficient. Thanks a lot.

          Show
          atm Aaron T. Myers added a comment - Can you please test this manually on a real cluster before committing it? Just running the various `hdfs haadmin' commands with and without security should be sufficient. Thanks a lot.
          Hide
          tlipcon Todd Lipcon added a comment -

          Did you mean for the changes to the NodeType enum to go into this patch? Not seeing the relation.

          Show
          tlipcon Todd Lipcon added a comment - Did you mean for the changes to the NodeType enum to go into this patch? Not seeing the relation.
          Hide
          sureshms Suresh Srinivas added a comment -

          Thanks Jitendra for catching the unnecessary changes. Updated patch removes thems.

          Aarong I manually ran the tests that use HAServiceProtocol, TestFailOverController and TestHAAdmin.

          Will test it manually on a real cluster, when I get time.

          Show
          sureshms Suresh Srinivas added a comment - Thanks Jitendra for catching the unnecessary changes. Updated patch removes thems. Aarong I manually ran the tests that use HAServiceProtocol, TestFailOverController and TestHAAdmin. Will test it manually on a real cluster, when I get time.
          Hide
          atm Aaron T. Myers added a comment -

          Aarong I manually ran the tests that use HAServiceProtocol, TestFailOverController and TestHAAdmin.

          Yea, but I've seen protocol changes like this break things when run on a real cluster that don't break when run from the tests. I'm particularly interested in seeing if the change affects things with/without security enabled.

          Will test it manually on a real cluster, when I get time.

          Thanks a lot!

          Show
          atm Aaron T. Myers added a comment - Aarong I manually ran the tests that use HAServiceProtocol, TestFailOverController and TestHAAdmin. Yea, but I've seen protocol changes like this break things when run on a real cluster that don't break when run from the tests. I'm particularly interested in seeing if the change affects things with/without security enabled. Will test it manually on a real cluster, when I get time. Thanks a lot!
          Hide
          sureshms Suresh Srinivas added a comment -

          Aaron, these are identical changes that have been tested on the trunk when we did protobuf changes. We did test secure and non-secure setup with those changes. I personally think it is not necessary for setting up secure and non-secure clusters to test this change. I think non-secure changes are tested sufficiently by existing unit tests.

          Show
          sureshms Suresh Srinivas added a comment - Aaron, these are identical changes that have been tested on the trunk when we did protobuf changes. We did test secure and non-secure setup with those changes. I personally think it is not necessary for setting up secure and non-secure clusters to test this change. I think non-secure changes are tested sufficiently by existing unit tests.
          Hide
          atm Aaron T. Myers added a comment -

          If Jitendra hadn't noticed the lack of KerberosInfo annotations, secure communication wouldn't have worked. Even though you've added them, their values being correct aren't covered at all by the tests.

          Another, similar change recently inadvertently broke several protocols because of some unintended consequences due to static initialization of RPC engines. This didn't show up in the tests because all daemons run in the same JVM, and thus static initialization for one will cover all.

          If we had a multi-process minicluster, or any support for automated testing with security enabled, I'd agree with you that running the automated tests would be sufficient. In lieu of that, please test this change manually.

          Show
          atm Aaron T. Myers added a comment - If Jitendra hadn't noticed the lack of KerberosInfo annotations, secure communication wouldn't have worked. Even though you've added them, their values being correct aren't covered at all by the tests. Another, similar change recently inadvertently broke several protocols because of some unintended consequences due to static initialization of RPC engines. This didn't show up in the tests because all daemons run in the same JVM, and thus static initialization for one will cover all. If we had a multi-process minicluster, or any support for automated testing with security enabled, I'd agree with you that running the automated tests would be sufficient. In lieu of that, please test this change manually.
          Hide
          sureshms Suresh Srinivas added a comment -

          Aaron, if we cannot commit based on unit tests then the process is broken. I do not want to waste time bringing up cluster and manually testing it, if possible.

          By the way, applying the same standard, the client failover code should not get committed at all. Because we know it does not work with delegation token

          Show
          sureshms Suresh Srinivas added a comment - Aaron, if we cannot commit based on unit tests then the process is broken. I do not want to waste time bringing up cluster and manually testing it, if possible. By the way, applying the same standard, the client failover code should not get committed at all. Because we know it does not work with delegation token
          Hide
          atm Aaron T. Myers added a comment -

          Aaron, if we cannot commit based on unit tests then the process is broken. I do not want to waste time bringing up cluster and manually testing it, if possible.

          Most of the time I would agree with you, but these sorts of changes aren't well covered by existing tests.

          Since I'm tired of arguing about manual testing, I just tested this patch manually on a cluster myself, and it appears to have broken haadmin. In particular, `hdfs haadmin -failover ...' always prints "Failover failed: Can't failover to an active service" and never causes an NN failover, even if both NNs are in the Standby state. If I revert this patch, `hdfs haadmin -failover ...' works again. `hdfs haadmin -getServiceState' appears to always return 'Initializing' with this patch applied, but works as intended without this patch. `hdfs haadmin -transitionToActive ...' and `hdfs haadmin -transitionToStandby ...' do appear to work with or without this patch.

          And for the record, it took me about 20 minutes to test this on a cluster.

          Show
          atm Aaron T. Myers added a comment - Aaron, if we cannot commit based on unit tests then the process is broken. I do not want to waste time bringing up cluster and manually testing it, if possible. Most of the time I would agree with you, but these sorts of changes aren't well covered by existing tests. Since I'm tired of arguing about manual testing, I just tested this patch manually on a cluster myself, and it appears to have broken haadmin. In particular, `hdfs haadmin -failover ...' always prints "Failover failed: Can't failover to an active service" and never causes an NN failover, even if both NNs are in the Standby state. If I revert this patch, `hdfs haadmin -failover ...' works again. `hdfs haadmin -getServiceState' appears to always return 'Initializing' with this patch applied, but works as intended without this patch. `hdfs haadmin -transitionToActive ...' and `hdfs haadmin -transitionToStandby ...' do appear to work with or without this patch. And for the record, it took me about 20 minutes to test this on a cluster.
          Hide
          sureshms Suresh Srinivas added a comment -

          Was that with security enabled or not?

          Show
          sureshms Suresh Srinivas added a comment - Was that with security enabled or not?
          Hide
          atm Aaron T. Myers added a comment -

          Was that with security enabled or not?

          That was with security disabled. I was going to try that next, but bumped into this issue first.

          Show
          atm Aaron T. Myers added a comment - Was that with security enabled or not? That was with security disabled. I was going to try that next, but bumped into this issue first.
          Hide
          sureshms Suresh Srinivas added a comment -

          Created a related jira HDFS-2937.

          Show
          sureshms Suresh Srinivas added a comment - Created a related jira HDFS-2937 .
          Hide
          tlipcon Todd Lipcon added a comment -

          The bug is here – missing "break"s:

          +    HAServiceStateProto ret;
          +    switch (s) {
          +    case ACTIVE:
          +      ret = HAServiceStateProto.ACTIVE;
          +    case STANDBY:
          +      ret = HAServiceStateProto.STANDBY;
          +    default:
          +    case INITIALIZING:
          +      ret = HAServiceStateProto.INITIALIZING;
          +    }
          
          Show
          tlipcon Todd Lipcon added a comment - The bug is here – missing "break"s: + HAServiceStateProto ret; + switch (s) { + case ACTIVE: + ret = HAServiceStateProto.ACTIVE; + case STANDBY: + ret = HAServiceStateProto.STANDBY; + default : + case INITIALIZING: + ret = HAServiceStateProto.INITIALIZING; + }
          Hide
          atm Aaron T. Myers added a comment -

          The bug is here – missing "break"s:

          Yep, good catch. There's actually two places with switch statements missing breaks in their cases in the patch.

          Show
          atm Aaron T. Myers added a comment - The bug is here – missing "break"s: Yep, good catch. There's actually two places with switch statements missing breaks in their cases in the patch.
          Hide
          atm Aaron T. Myers added a comment -

          There's actually two places with switch statements missing breaks in their cases in the patch.

          Never mind. I can't read. Only one place matters, since in the other place we actually call "return" instead of just setting a variable to be returned.

          Show
          atm Aaron T. Myers added a comment - There's actually two places with switch statements missing breaks in their cases in the patch. Never mind. I can't read. Only one place matters, since in the other place we actually call "return" instead of just setting a variable to be returned.
          Hide
          sureshms Suresh Srinivas added a comment -

          Good catch Todd. I tested the old patch with test from HDFS-2937 and it failed as expected. The new patch passes that test.

          Show
          sureshms Suresh Srinivas added a comment - Good catch Todd. I tested the old patch with test from HDFS-2937 and it failed as expected. The new patch passes that test.
          Hide
          atm Aaron T. Myers added a comment -

          I tested out the latest patch with and without security enabled, and it worked like a charm.

          One tiny nit: I'd swap the order of the "INITIALIZING" and "default" cases in the switch statement in HAServiceProtocolServerSideTranslatorPB#getServiceState, so that "default" comes last.

          +1 otherwise.

          Thanks a lot for taking care of this, Suresh.

          Show
          atm Aaron T. Myers added a comment - I tested out the latest patch with and without security enabled, and it worked like a charm. One tiny nit: I'd swap the order of the "INITIALIZING" and "default" cases in the switch statement in HAServiceProtocolServerSideTranslatorPB#getServiceState , so that "default" comes last. +1 otherwise. Thanks a lot for taking care of this, Suresh.
          Hide
          atm Aaron T. Myers added a comment -

          Here's a (very slightly) updated patch which addresses the nit. I'm going to commit this momentarily.

          Show
          atm Aaron T. Myers added a comment - Here's a (very slightly) updated patch which addresses the nit. I'm going to commit this momentarily.
          Hide
          atm Aaron T. Myers added a comment -

          I've just committed this to the HA branch. Thanks a lot, Suresh.

          Show
          atm Aaron T. Myers added a comment - I've just committed this to the HA branch. Thanks a lot, Suresh.
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #80 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/80/)
          HDFS-2586. Add protobuf service and implementation for HAServiceProtocol. Contributed by Suresh Srinivas. (Revision 1245338)

          Result = FAILURE
          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1245338
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB/HAServiceProtocolClientSideTranslatorPB.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB/HAServiceProtocolPB.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB/HAServiceProtocolServerSideTranslatorPB.java
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/main/proto/HAServiceProtocol.proto
          • /hadoop/common/branches/HDFS-1623/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #80 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/80/ ) HDFS-2586 . Add protobuf service and implementation for HAServiceProtocol. Contributed by Suresh Srinivas. (Revision 1245338) Result = FAILURE atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1245338 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/HAAdmin.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB/HAServiceProtocolClientSideTranslatorPB.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB/HAServiceProtocolPB.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/protocolPB/HAServiceProtocolServerSideTranslatorPB.java /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/main/proto/HAServiceProtocol.proto /hadoop/common/branches/ HDFS-1623 /hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ha/TestFailoverController.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java

            People

            • Assignee:
              sureshms Suresh Srinivas
              Reporter:
              sureshms Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development