Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7009

Active NN and standby NN have different live nodes

Details

    • Reviewed

    Description

      To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most cases, given DN sends HB and BR to NN regularly, if a specific RPC call fails, it isn't a big deal.

      However, there are cases where DN fails to register with NN during initial handshake due to exceptions not covered by RPC client's connection retry. When this happens, the DN won't talk to that NN until the DN restarts.

      BPServiceActor
      
        public void run() {
          LOG.info(this + " starting to offer service");
      
          try {
            // init stuff
            try {
              // setup storage
              connectToNNAndHandshake();
            } catch (IOException ioe) {
              // Initial handshake, storage recovery or registration failed
              // End BPOfferService thread
              LOG.fatal("Initialization failed for block pool " + this, ioe);
              return;
            }
      
            initialized = true; // bp is initialized;
            
            while (shouldRun()) {
              try {
                offerService();
              } catch (Exception ex) {
                LOG.error("Exception in BPOfferService for " + this, ex);
                sleepAndLogInterrupts(5000, "offering service");
              }
            }
      ...
      

      Here is an example of the call stack.

      java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "xxx"; destination host is: "yyy":8030;
              at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
              at org.apache.hadoop.ipc.Client.call(Client.java:1239)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
              at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:606)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
              at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
              at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Response is null.
              at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
              at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
      

      This will create discrepancy between active NN and standby NN in terms of live nodes.

      Here is a possible scenario of missing blocks after failover.

      1. DN A, B set up handshakes with active NN, but not with standby NN.
      2. A block is replicated to DN A, B and C.
      3. From standby NN's point of view, given A and B are dead nodes, the block is under replicated.
      4. DN C is down.
      5. Before active NN detects DN C is down, it fails over.
      6. The new active NN considers the block is missing. Even though there are two replicas on DN A and B.

      Attachments

        1. HDFS-7009.patch
          10 kB
          Ming Ma
        2. HDFS-7009-2.patch
          10 kB
          Ming Ma
        3. HDFS-7009-3.patch
          10 kB
          Ming Ma
        4. HDFS-7009-4.patch
          10 kB
          Ming Ma

        Issue Links

          Activity

            mingma Ming Ma added a comment -

            Given there are existing retries inside BPServiceActor, the patch just add additional retry in BPServiceActor.

            The policy is to retry for configurable max number of times in the case of IOException that isn't RemoteException. In that way, it will cover the common case of IOException caused by network issue. If NN throws DisallowedDatanodeException exception, it will be wrapped under RemoteException; BPServiceActor won't retry in that scenario.

            Note that this issue can happen outside NN startup time. When NN lost heartbeat from the DN and DN reconnect with NN later, reregistration can throw IOException due to network issue and subsequent incremental BR RPC will fail with UnregisteredNodeException; that will cause BPServiceActor to shutdown.

            mingma Ming Ma added a comment - Given there are existing retries inside BPServiceActor, the patch just add additional retry in BPServiceActor. The policy is to retry for configurable max number of times in the case of IOException that isn't RemoteException. In that way, it will cover the common case of IOException caused by network issue. If NN throws DisallowedDatanodeException exception, it will be wrapped under RemoteException; BPServiceActor won't retry in that scenario. Note that this issue can happen outside NN startup time. When NN lost heartbeat from the DN and DN reconnect with NN later, reregistration can throw IOException due to network issue and subsequent incremental BR RPC will fail with UnregisteredNodeException; that will cause BPServiceActor to shutdown.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12671618/HDFS-7009.patch
            against trunk revision 5f16c98.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 1 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. There were no new javadoc warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

            org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//testReport/
            Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671618/HDFS-7009.patch against trunk revision 5f16c98. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//console This message is automatically generated.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch
            against trunk revision 5f16c98.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 1 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. There were no new javadoc warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

            org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
            org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//testReport/
            Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch against trunk revision 5f16c98. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//console This message is automatically generated.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch
            against trunk revision 2d8e6e2.

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8309//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch against trunk revision 2d8e6e2. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8309//console This message is automatically generated.
            mingma Ming Ma added a comment -

            Rebase with trunk.

            mingma Ming Ma added a comment - Rebase with trunk.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12672779/HDFS-7009-2.patch
            against trunk revision 2d8e6e2.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 1 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. There were no new javadoc warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.
            -1 core tests. Failed to build the native portion of hadoop-common prior to running the unit tests in hadoop-hdfs-project/hadoop-hdfs

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//testReport/
            Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672779/HDFS-7009-2.patch against trunk revision 2d8e6e2. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . Failed to build the native portion of hadoop-common prior to running the unit tests in hadoop-hdfs-project/hadoop-hdfs +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//console This message is automatically generated.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12672870/HDFS-7009-2.patch
            against trunk revision 7f6ed7f.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 1 new or modified test files.

            -1 javac. The applied patch generated 1280 javac compiler warnings (more than the trunk's current 1266 warnings).

            +1 javadoc. There were no new javadoc warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            -1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

            org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
            org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
            org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//testReport/
            Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
            Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/diffJavacWarnings.txt
            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672870/HDFS-7009-2.patch against trunk revision 7f6ed7f. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. -1 javac . The applied patch generated 1280 javac compiler warnings (more than the trunk's current 1266 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//console This message is automatically generated.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch
            against trunk revision bbb3b1a.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 1 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. There were no new javadoc warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            -1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

            org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
            org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS

            +1 contrib tests. The patch passed contrib unit tests.

            Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//testReport/
            Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch against trunk revision bbb3b1a. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//console This message is automatically generated.
            mingma Ming Ma added a comment -

            Findbugs and failed unit tests aren't related.

            mingma Ming Ma added a comment - Findbugs and failed unit tests aren't related.
            arp Arpit Agarwal added a comment -

            Hi mingma, thanks for reporting this issue and posting the patch. Does this bug still exist after 2.4.1?

            It looks like BPServiceActor#run has a retry loop added by HDFS-2882.

              public void run() {
            
                try {
                  while (true) {
                    // init stuff
                    try {
                      // setup storage
                      connectToNNAndHandshake();
                      break;
                    } catch (IOException ioe) {
                      // Initial handshake, storage recovery or registration failed
                      runningState = RunningState.INIT_FAILED;
                      if (shouldRetryInit()) {
                        // Retry until all namenode's of BPOS failed initialization
                        LOG.error("Initialization failed for " + this + " "
                            + ioe.getLocalizedMessage());
                        sleepAndLogInterrupts(5000, "initializing");
            
            arp Arpit Agarwal added a comment - Hi mingma , thanks for reporting this issue and posting the patch. Does this bug still exist after 2.4.1? It looks like BPServiceActor#run has a retry loop added by HDFS-2882 . public void run() { try { while ( true ) { // init stuff try { // setup storage connectToNNAndHandshake(); break ; } catch (IOException ioe) { // Initial handshake, storage recovery or registration failed runningState = RunningState.INIT_FAILED; if (shouldRetryInit()) { // Retry until all namenode's of BPOS failed initialization LOG.error( "Initialization failed for " + this + " " + ioe.getLocalizedMessage()); sleepAndLogInterrupts(5000, "initializing" );
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch
            against trunk revision 2f1e5dc.

            -1 patch. The patch command could not apply the patch.

            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9569//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch against trunk revision 2f1e5dc. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9569//console This message is automatically generated.
            mingma Ming Ma added a comment -

            Thanks, arpitagarwal. The patch seems to be useful even after HDFS-2882 as it handles exception outside after initialization. Actually it looks quite like the patch in HDFS-7714 from vinayrpet and cnauroth. HDFS-7714 catches only EOFException; but in the call stack above it comes from throw new IOException("Response is null."); in RPC Client.

            mingma Ming Ma added a comment - Thanks, arpitagarwal . The patch seems to be useful even after HDFS-2882 as it handles exception outside after initialization. Actually it looks quite like the patch in HDFS-7714 from vinayrpet and cnauroth . HDFS-7714 catches only EOFException; but in the call stack above it comes from throw new IOException("Response is null."); in RPC Client.
            arp Arpit Agarwal added a comment -

            The patch seems to be useful even after HDFS-2882 as it handles exception outside after initialization.

            Thanks for the response Ming, are you referring to reRegister?

            arp Arpit Agarwal added a comment - The patch seems to be useful even after HDFS-2882 as it handles exception outside after initialization. Thanks for the response Ming, are you referring to reRegister ?
            cnauroth Chris Nauroth added a comment -

            Hi mingma. Thanks for giving me the notification, and I'm sorry I didn't spot this before I filed HDFS-7714. You're right that it's very similar.

            I think it's helpful that your patch switches from whitelisting a set of acceptable errors (potentially unpredictable) to blacklisting known fatal errors (well-defined as DisallowedDatanodeException).

            I don't think we need a configurable maximum retry count. Error handling in the DataNode/NameNode connection traditionally has been handled with infinite retries. This keeps the DataNode process up and running and robust against unplanned NameNode downtime. Let me know if you disagree on this point.

            If you want to rebase the patch, I think it would be valuable to get it in. Thanks again!

            cnauroth Chris Nauroth added a comment - Hi mingma . Thanks for giving me the notification, and I'm sorry I didn't spot this before I filed HDFS-7714 . You're right that it's very similar. I think it's helpful that your patch switches from whitelisting a set of acceptable errors (potentially unpredictable) to blacklisting known fatal errors (well-defined as DisallowedDatanodeException ). I don't think we need a configurable maximum retry count. Error handling in the DataNode/NameNode connection traditionally has been handled with infinite retries. This keeps the DataNode process up and running and robust against unplanned NameNode downtime. Let me know if you disagree on this point. If you want to rebase the patch, I think it would be valuable to get it in. Thanks again!
            mingma Ming Ma added a comment -

            Thanks, Arpit. Yes, I meant reRegister.

            Thanks, Chris. I agree with both of your points. Here is the updated patch. The fix is to return specific exception from RPC client; it appears EOFException is good choice for this specific scenario.

            mingma Ming Ma added a comment - Thanks, Arpit. Yes, I meant reRegister. Thanks, Chris. I agree with both of your points. Here is the updated patch. The fix is to return specific exception from RPC client; it appears EOFException is good choice for this specific scenario.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12698889/HDFS-7009-3.patch
            against trunk revision 6804d68.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 1 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. There were no new javadoc warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

            org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade

            Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//testReport/
            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698889/HDFS-7009-3.patch against trunk revision 6804d68. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//console This message is automatically generated.
            cnauroth Chris Nauroth added a comment -

            Thanks for updating the patch, Ming. I wasn't thinking of a fix at the RPC client layer, but after seeing the patch, I think this is the right thing to do. The protobuf parseDelimitedFrom method is documented to return null if the input stream is already at EOF, so semantically, EOFException is the right error code. This change may also benefit other RPC clients, such as YARN's RMProxy, where there is a retry policy associated with EOFException.

            Since this is a change lower down at the RPC layer, I'd like to wait until next week to commit, in case anyone else wants to review. I'm also notifying szetszwo, who originally worked on this code for HDFS-3504 (configurable retry policies for DFSClient). Nicholas, do you see any problem with making this change?

            You'll need to update the patch one more time. The method signature of sendHeartbeat changed recently. You'll need to add one more parameter to that call in the test, and it can be set to Mockito.any(VolumeFailureSummary.class). There are also some typos: "mokito" instead of "mockito". Let's correct those.

            The test failure in the last Jenkins run appears to be unrelated.

            Thanks again for your work on this, Ming!

            cnauroth Chris Nauroth added a comment - Thanks for updating the patch, Ming. I wasn't thinking of a fix at the RPC client layer, but after seeing the patch, I think this is the right thing to do. The protobuf parseDelimitedFrom method is documented to return null if the input stream is already at EOF, so semantically, EOFException is the right error code. This change may also benefit other RPC clients, such as YARN's RMProxy , where there is a retry policy associated with EOFException . Since this is a change lower down at the RPC layer, I'd like to wait until next week to commit, in case anyone else wants to review. I'm also notifying szetszwo , who originally worked on this code for HDFS-3504 (configurable retry policies for DFSClient). Nicholas, do you see any problem with making this change? You'll need to update the patch one more time. The method signature of sendHeartbeat changed recently. You'll need to add one more parameter to that call in the test, and it can be set to Mockito.any(VolumeFailureSummary.class) . There are also some typos: "mokito" instead of "mockito". Let's correct those. The test failure in the last Jenkins run appears to be unrelated. Thanks again for your work on this, Ming!
            mingma Ming Ma added a comment -

            Thanks, Chris. Here is the updated patch.

            Nicholas can confirm, FailoverOnNetworkExceptionRetry defined in RetryPolicies handles IOException that isn't RemoteException. So this change shouldn't change that behavior.

            mingma Ming Ma added a comment - Thanks, Chris. Here is the updated patch. Nicholas can confirm, FailoverOnNetworkExceptionRetry defined in RetryPolicies handles IOException that isn't RemoteException . So this change shouldn't change that behavior.
            hadoopqa Hadoop QA added a comment -

            +1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12700000/HDFS-7009-4.patch
            against trunk revision 6f01330.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 1 new or modified test files.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 javadoc. There were no new javadoc warning messages.

            +1 eclipse:eclipse. The patch built with eclipse:eclipse.

            +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

            Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//testReport/
            Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700000/HDFS-7009-4.patch against trunk revision 6f01330. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//console This message is automatically generated.
            szetszwo Tsz-wo Sze added a comment -

            It looks like that throwing EOFException is a good choice since other methods such as in.readInt() also throws EOFException. I do have a question – In receiveRpcResponse, it first read totalLen, then the rpc response header and then the rpc body as shown below. Is there any reason that the input steam ends right after reading the totalLen or just a coincidence?

                    int totalLen = in.readInt();
                    RpcResponseHeaderProto header = 
                        RpcResponseHeaderProto.parseDelimitedFrom(in);
                    checkResponse(header);
                    ...
                      value.readFields(in);                 // read value
            
            szetszwo Tsz-wo Sze added a comment - It looks like that throwing EOFException is a good choice since other methods such as in.readInt() also throws EOFException. I do have a question – In receiveRpcResponse, it first read totalLen, then the rpc response header and then the rpc body as shown below. Is there any reason that the input steam ends right after reading the totalLen or just a coincidence? int totalLen = in.readInt(); RpcResponseHeaderProto header = RpcResponseHeaderProto.parseDelimitedFrom(in); checkResponse(header); ... value.readFields(in); // read value
            cnauroth Chris Nauroth added a comment -

            szetszwo, thank you for taking a look.

            Is there any reason that the input stream ends right after reading the totalLen or just a coincidence?

            Good question. Ultimately, this was just a coincidence of a DataNode trying to register during a poorly timed NameNode restart. Both Ming and I have observed slightly different versions of this problem. HDFS-7714 fixed the problem I saw by handling EOFException during registration, but we still need Ming's patch here to cover the slightly different problem he saw.

            There are 4 separate cases to consider:

            1. DataNode connects to NameNode and sends registration request. NameNode shuts down and terminates socket connection before writing any RPC response bytes. At the DataNode, the RPC client observes this as an EOFException thrown from the DataInputStream#readInt call. With HDFS-7714, this case is handled correctly.
            2. DataNode connects to NameNode. NameNode sends response length and starts sending a response header, but it shuts down and terminates the socket connection before writing the complete response header. The contract of parseDelimitedFrom states that unexpected EOF part-way through parsing will propagate an EOFException to the caller. At the DataNode, the RPC client observes the EOFException and therefore HDFS-7714 handles this case correctly too.
            3. DataNode connects to NameNode. NameNode sends response length and complete response header, and then starts writing the response body, but shuts down and terminates the socket connection before writing the complete response body. At the DataNode, the RPC client observes EOFException while trying to read the response body bytes, and therefore HDFS-7714 handles this case correctly too.
            4. DataNode connects to NameNode. NameNode sends only response length, and then shuts down and terminates the socket connection before sending anything else. The contract of parseDelimitedFrom states that if the stream is already positioned at EOF, then the return value is null. At the DataNode, the current RPC client code handles this case by throwing IOException. This isn't sufficient information for the DataNode to know if it's safe to reattempt registration, even with HDFS-7714, so this is still a registration failure.

            Here is the documentation for parseDelimitedFrom:

            https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/AbstractParser#parseDelimitedFrom(java.io.InputStream)

            It's probably a documentation bug that they say the return value is false. Here is the actual protobuf code from AbstractParser#parsePartialDelimitedFrom where we see it checking the stream for EOF and returning null before attempting to parse:

              public MessageType parsePartialDelimitedFrom(
                  InputStream input,
                  ExtensionRegistryLite extensionRegistry)
                  throws InvalidProtocolBufferException {
                int size;
                try {
                  int firstByte = input.read();
                  if (firstByte == -1) {
                    return null;
                  }
                  size = CodedInputStream.readRawVarint32(firstByte, input);
                } catch (IOException e) {
                  throw new InvalidProtocolBufferException(e.getMessage());
                }
                InputStream limitedInput = new LimitedInputStream(input, size);
                return parsePartialFrom(limitedInput, extensionRegistry);
              }
            

            To summarize, HDFS-7714 is sufficient to handle cases 1-3, but we still need Ming's patch here for correct handling of case 4. I also think it's correct behavior for all RPC clients, not just the specific case of DataNode registration.

            cnauroth Chris Nauroth added a comment - szetszwo , thank you for taking a look. Is there any reason that the input stream ends right after reading the totalLen or just a coincidence? Good question. Ultimately, this was just a coincidence of a DataNode trying to register during a poorly timed NameNode restart. Both Ming and I have observed slightly different versions of this problem. HDFS-7714 fixed the problem I saw by handling EOFException during registration, but we still need Ming's patch here to cover the slightly different problem he saw. There are 4 separate cases to consider: DataNode connects to NameNode and sends registration request. NameNode shuts down and terminates socket connection before writing any RPC response bytes. At the DataNode, the RPC client observes this as an EOFException thrown from the DataInputStream#readInt call. With HDFS-7714 , this case is handled correctly. DataNode connects to NameNode. NameNode sends response length and starts sending a response header, but it shuts down and terminates the socket connection before writing the complete response header. The contract of parseDelimitedFrom states that unexpected EOF part-way through parsing will propagate an EOFException to the caller. At the DataNode, the RPC client observes the EOFException and therefore HDFS-7714 handles this case correctly too. DataNode connects to NameNode. NameNode sends response length and complete response header, and then starts writing the response body, but shuts down and terminates the socket connection before writing the complete response body. At the DataNode, the RPC client observes EOFException while trying to read the response body bytes, and therefore HDFS-7714 handles this case correctly too. DataNode connects to NameNode. NameNode sends only response length, and then shuts down and terminates the socket connection before sending anything else. The contract of parseDelimitedFrom states that if the stream is already positioned at EOF, then the return value is null . At the DataNode, the current RPC client code handles this case by throwing IOException . This isn't sufficient information for the DataNode to know if it's safe to reattempt registration, even with HDFS-7714 , so this is still a registration failure. Here is the documentation for parseDelimitedFrom : https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/AbstractParser#parseDelimitedFrom(java.io.InputStream ) It's probably a documentation bug that they say the return value is false . Here is the actual protobuf code from AbstractParser#parsePartialDelimitedFrom where we see it checking the stream for EOF and returning null before attempting to parse: public MessageType parsePartialDelimitedFrom( InputStream input, ExtensionRegistryLite extensionRegistry) throws InvalidProtocolBufferException { int size; try { int firstByte = input.read(); if (firstByte == -1) { return null ; } size = CodedInputStream.readRawVarint32(firstByte, input); } catch (IOException e) { throw new InvalidProtocolBufferException(e.getMessage()); } InputStream limitedInput = new LimitedInputStream(input, size); return parsePartialFrom(limitedInput, extensionRegistry); } To summarize, HDFS-7714 is sufficient to handle cases 1-3, but we still need Ming's patch here for correct handling of case 4. I also think it's correct behavior for all RPC clients, not just the specific case of DataNode registration.
            szetszwo Tsz-wo Sze added a comment -

            cnauroth, thanks for the detailed explanation.

            +1 on the patch.

            szetszwo Tsz-wo Sze added a comment - cnauroth , thanks for the detailed explanation. +1 on the patch.
            cnauroth Chris Nauroth added a comment -

            +1 from me too. I'll commit this later today.

            cnauroth Chris Nauroth added a comment - +1 from me too. I'll commit this later today.
            cnauroth Chris Nauroth added a comment -

            I have committed this to trunk and branch-2. Ming, thank you for contributing the patch. Arpit and Nicholas, thank you for your help on the code review.

            cnauroth Chris Nauroth added a comment - I have committed this to trunk and branch-2. Ming, thank you for contributing the patch. Arpit and Nicholas, thank you for your help on the code review.
            hudson Hudson added a comment -

            FAILURE: Integrated in Hadoop-trunk-Commit #7178 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7178/)
            HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

            • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
            • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7178 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7178/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            mingma Ming Ma added a comment -

            Thanks, Chris, Arpit and Nicholas.

            mingma Ming Ma added a comment - Thanks, Chris, Arpit and Nicholas.
            hudson Hudson added a comment -

            FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/114/)
            HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

            • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
            • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/114/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            hudson Hudson added a comment -

            SUCCESS: Integrated in Hadoop-Yarn-trunk #848 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/848/)
            HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

            • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
            • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #848 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/848/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment -

            FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #105 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/105/)
            HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

            • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
            • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #105 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/105/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment -

            SUCCESS: Integrated in Hadoop-Hdfs-trunk #2046 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2046/)
            HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

            • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
            • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #2046 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2046/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment -

            FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/114/)
            HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

            • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
            • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/114/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment -

            FAILURE: Integrated in Hadoop-Mapreduce-trunk #2064 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2064/)
            HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

            • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
            • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
            • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
            hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2064 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2064/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

            sjlee0 backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation and TestDatanodeProtocolRetryPolicy which changed in the patch.

            vinodkv Vinod Kumar Vavilapalli added a comment - sjlee0 backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation and TestDatanodeProtocolRetryPolicy which changed in the patch.

            People

              mingma Ming Ma
              mingma Ming Ma
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: