[HDFS-7009] Active NN and standby NN have different live nodes - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.7.0, 2.6.1, 3.0.0-alpha1
Component/s: datanode
Labels:
- 2.6.1-candidate

Target Version/s:

2.7.0
Hadoop Flags:

Reviewed

Description

To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most cases, given DN sends HB and BR to NN regularly, if a specific RPC call fails, it isn't a big deal.

However, there are cases where DN fails to register with NN during initial handshake due to exceptions not covered by RPC client's connection retry. When this happens, the DN won't talk to that NN until the DN restarts.

BPServiceActor

  public void run() {
    LOG.info(this + " starting to offer service");

    try {
      // init stuff
      try {
        // setup storage
        connectToNNAndHandshake();
      } catch (IOException ioe) {
        // Initial handshake, storage recovery or registration failed
        // End BPOfferService thread
        LOG.fatal("Initialization failed for block pool " + this, ioe);
        return;
      }

      initialized = true; // bp is initialized;
      
      while (shouldRun()) {
        try {
          offerService();
        } catch (Exception ex) {
          LOG.error("Exception in BPOfferService for " + this, ex);
          sleepAndLogInterrupts(5000, "offering service");
        }
      }
...

Here is an example of the call stack.

java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "xxx"; destination host is: "yyy":8030;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
        at org.apache.hadoop.ipc.Client.call(Client.java:1239)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
        at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Response is null.
        at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)

This will create discrepancy between active NN and standby NN in terms of live nodes.

Here is a possible scenario of missing blocks after failover.

1. DN A, B set up handshakes with active NN, but not with standby NN.
2. A block is replicated to DN A, B and C.
3. From standby NN's point of view, given A and B are dead nodes, the block is under replicated.
4. DN C is down.
5. Before active NN detects DN C is down, it fails over.
6. The new active NN considers the block is missing. Even though there are two replicas on DN A and B.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-7009.patch
27/Sep/14 05:59
10 kB
Ming Ma
HDFS-7009-2.patch
04/Oct/14 04:05
10 kB
Ming Ma
HDFS-7009-3.patch
14/Feb/15 06:00
10 kB
Ming Ma
HDFS-7009-4.patch
21/Feb/15 03:51
10 kB
Ming Ma

Issue Links

relates to

HDFS-2882 DN continues to start up, even if block pool fails to initialize

Closed

HDFS-7714 Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully with only one NameNode.

Closed

Activity

Ascending order - Click to sort in descending order

Ming Ma added a comment - 27/Sep/14 02:40

Given there are existing retries inside BPServiceActor, the patch just add additional retry in BPServiceActor.

The policy is to retry for configurable max number of times in the case of IOException that isn't RemoteException. In that way, it will cover the common case of IOException caused by network issue. If NN throws DisallowedDatanodeException exception, it will be wrapped under RemoteException; BPServiceActor won't retry in that scenario.

Note that this issue can happen outside NN startup time. When NN lost heartbeat from the DN and DN reconnect with NN later, reregistration can throw IOException due to network issue and subsequent incremental BR RPC will fail with UnregisteredNodeException; that will cause BPServiceActor to shutdown.

Ming Ma added a comment - 27/Sep/14 02:40 Given there are existing retries inside BPServiceActor, the patch just add additional retry in BPServiceActor. The policy is to retry for configurable max number of times in the case of IOException that isn't RemoteException. In that way, it will cover the common case of IOException caused by network issue. If NN throws DisallowedDatanodeException exception, it will be wrapped under RemoteException; BPServiceActor won't retry in that scenario. Note that this issue can happen outside NN startup time. When NN lost heartbeat from the DN and DN reconnect with NN later, reregistration can throw IOException due to network issue and subsequent incremental BR RPC will fail with UnregisteredNodeException; that will cause BPServiceActor to shutdown.

Hadoop QA added a comment - 27/Sep/14 04:05

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12671618/HDFS-7009.patch
against trunk revision 5f16c98.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 1 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS

+1 contrib tests. The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//console

This message is automatically generated.

Hadoop QA added a comment - 27/Sep/14 04:05 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671618/HDFS-7009.patch against trunk revision 5f16c98. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//console This message is automatically generated.

Hadoop QA added a comment - 27/Sep/14 08:48

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch
against trunk revision 5f16c98.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 1 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

+1 contrib tests. The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//console

This message is automatically generated.

Hadoop QA added a comment - 27/Sep/14 08:48 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch against trunk revision 5f16c98. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//console This message is automatically generated.

Hadoop QA added a comment - 03/Oct/14 05:48

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch
against trunk revision 2d8e6e2.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8309//console

This message is automatically generated.

Hadoop QA added a comment - 03/Oct/14 05:48 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch against trunk revision 2d8e6e2. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8309//console This message is automatically generated.

Ming Ma added a comment - 03/Oct/14 15:51

Rebase with trunk.

Ming Ma added a comment - 03/Oct/14 15:51 Rebase with trunk.

Hadoop QA added a comment - 03/Oct/14 16:34

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672779/HDFS-7009-2.patch
against trunk revision 2d8e6e2.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 1 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. Failed to build the native portion of hadoop-common prior to running the unit tests in hadoop-hdfs-project/hadoop-hdfs

+1 contrib tests. The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//console

This message is automatically generated.

Hadoop QA added a comment - 03/Oct/14 16:34 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672779/HDFS-7009-2.patch against trunk revision 2d8e6e2. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . Failed to build the native portion of hadoop-common prior to running the unit tests in hadoop-hdfs-project/hadoop-hdfs +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//console This message is automatically generated.

Hadoop QA added a comment - 04/Oct/14 02:14

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672870/HDFS-7009-2.patch
against trunk revision 7f6ed7f.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 1 new or modified test files.

-1 javac. The applied patch generated 1280 javac compiler warnings (more than the trunk's current 1266 warnings).

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

+1 contrib tests. The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//console

This message is automatically generated.

Hadoop QA added a comment - 04/Oct/14 02:14 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672870/HDFS-7009-2.patch against trunk revision 7f6ed7f. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. -1 javac . The applied patch generated 1280 javac compiler warnings (more than the trunk's current 1266 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//console This message is automatically generated.

Hadoop QA added a comment - 04/Oct/14 07:05

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch
against trunk revision bbb3b1a.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 1 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS

+1 contrib tests. The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//console

This message is automatically generated.

Hadoop QA added a comment - 04/Oct/14 07:05 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch against trunk revision bbb3b1a. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//console This message is automatically generated.

Ming Ma added a comment - 06/Oct/14 14:44

Findbugs and failed unit tests aren't related.

Ming Ma added a comment - 06/Oct/14 14:44 Findbugs and failed unit tests aren't related.

Arpit Agarwal added a comment - 13/Feb/15 03:23

Hi mingma, thanks for reporting this issue and posting the patch. Does this bug still exist after 2.4.1?

It looks like BPServiceActor#run has a retry loop added by ~~HDFS-2882~~.

  public void run() {

    try {
      while (true) {
        // init stuff
        try {
          // setup storage
          connectToNNAndHandshake();
          break;
        } catch (IOException ioe) {
          // Initial handshake, storage recovery or registration failed
          runningState = RunningState.INIT_FAILED;
          if (shouldRetryInit()) {
            // Retry until all namenode's of BPOS failed initialization
            LOG.error("Initialization failed for " + this + " "
                + ioe.getLocalizedMessage());
            sleepAndLogInterrupts(5000, "initializing");

Arpit Agarwal added a comment - 13/Feb/15 03:23 Hi mingma , thanks for reporting this issue and posting the patch. Does this bug still exist after 2.4.1? It looks like BPServiceActor#run has a retry loop added by HDFS-2882 . public void run() { try { while ( true ) { // init stuff try { // setup storage connectToNNAndHandshake(); break ; } catch (IOException ioe) { // Initial handshake, storage recovery or registration failed runningState = RunningState.INIT_FAILED; if (shouldRetryInit()) { // Retry until all namenode's of BPOS failed initialization LOG.error( "Initialization failed for " + this + " " + ioe.getLocalizedMessage()); sleepAndLogInterrupts(5000, "initializing" );

Hadoop QA added a comment - 13/Feb/15 03:30

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch
against trunk revision 2f1e5dc.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9569//console

This message is automatically generated.

Hadoop QA added a comment - 13/Feb/15 03:30 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch against trunk revision 2f1e5dc. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9569//console This message is automatically generated.

Ming Ma added a comment - 13/Feb/15 04:20

Thanks, arpitagarwal. The patch seems to be useful even after ~~HDFS-2882~~ as it handles exception outside after initialization. Actually it looks quite like the patch in ~~HDFS-7714~~ from vinayrpet and cnauroth. ~~HDFS-7714~~ catches only EOFException; but in the call stack above it comes from throw new IOException("Response is null."); in RPC Client.

Ming Ma added a comment - 13/Feb/15 04:20 Thanks, arpitagarwal . The patch seems to be useful even after HDFS-2882 as it handles exception outside after initialization. Actually it looks quite like the patch in HDFS-7714 from vinayrpet and cnauroth . HDFS-7714 catches only EOFException; but in the call stack above it comes from throw new IOException("Response is null."); in RPC Client.

Arpit Agarwal added a comment - 13/Feb/15 19:42

The patch seems to be useful even after ~~HDFS-2882~~ as it handles exception outside after initialization.

Thanks for the response Ming, are you referring to reRegister?

Arpit Agarwal added a comment - 13/Feb/15 19:42 The patch seems to be useful even after HDFS-2882 as it handles exception outside after initialization. Thanks for the response Ming, are you referring to reRegister ?

Chris Nauroth added a comment - 13/Feb/15 19:59

Hi mingma. Thanks for giving me the notification, and I'm sorry I didn't spot this before I filed ~~HDFS-7714~~. You're right that it's very similar.

I think it's helpful that your patch switches from whitelisting a set of acceptable errors (potentially unpredictable) to blacklisting known fatal errors (well-defined as DisallowedDatanodeException).

I don't think we need a configurable maximum retry count. Error handling in the DataNode/NameNode connection traditionally has been handled with infinite retries. This keeps the DataNode process up and running and robust against unplanned NameNode downtime. Let me know if you disagree on this point.

If you want to rebase the patch, I think it would be valuable to get it in. Thanks again!

Chris Nauroth added a comment - 13/Feb/15 19:59 Hi mingma . Thanks for giving me the notification, and I'm sorry I didn't spot this before I filed HDFS-7714 . You're right that it's very similar. I think it's helpful that your patch switches from whitelisting a set of acceptable errors (potentially unpredictable) to blacklisting known fatal errors (well-defined as DisallowedDatanodeException ). I don't think we need a configurable maximum retry count. Error handling in the DataNode/NameNode connection traditionally has been handled with infinite retries. This keeps the DataNode process up and running and robust against unplanned NameNode downtime. Let me know if you disagree on this point. If you want to rebase the patch, I think it would be valuable to get it in. Thanks again!

Ming Ma added a comment - 14/Feb/15 06:00

Thanks, Arpit. Yes, I meant reRegister.

Thanks, Chris. I agree with both of your points. Here is the updated patch. The fix is to return specific exception from RPC client; it appears EOFException is good choice for this specific scenario.

Ming Ma added a comment - 14/Feb/15 06:00 Thanks, Arpit. Yes, I meant reRegister. Thanks, Chris. I agree with both of your points. Here is the updated patch. The fix is to return specific exception from RPC client; it appears EOFException is good choice for this specific scenario.

Hadoop QA added a comment - 14/Feb/15 09:45

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12698889/HDFS-7009-3.patch
against trunk revision 6804d68.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 1 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//console

This message is automatically generated.

Hadoop QA added a comment - 14/Feb/15 09:45 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698889/HDFS-7009-3.patch against trunk revision 6804d68. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//console This message is automatically generated.

Chris Nauroth added a comment - 21/Feb/15 00:16

Thanks for updating the patch, Ming. I wasn't thinking of a fix at the RPC client layer, but after seeing the patch, I think this is the right thing to do. The protobuf parseDelimitedFrom method is documented to return null if the input stream is already at EOF, so semantically, EOFException is the right error code. This change may also benefit other RPC clients, such as YARN's RMProxy, where there is a retry policy associated with EOFException.

Since this is a change lower down at the RPC layer, I'd like to wait until next week to commit, in case anyone else wants to review. I'm also notifying szetszwo, who originally worked on this code for ~~HDFS-3504~~ (configurable retry policies for DFSClient). Nicholas, do you see any problem with making this change?

You'll need to update the patch one more time. The method signature of sendHeartbeat changed recently. You'll need to add one more parameter to that call in the test, and it can be set to Mockito.any(VolumeFailureSummary.class). There are also some typos: "mokito" instead of "mockito". Let's correct those.

The test failure in the last Jenkins run appears to be unrelated.

Thanks again for your work on this, Ming!

Chris Nauroth added a comment - 21/Feb/15 00:16 Thanks for updating the patch, Ming. I wasn't thinking of a fix at the RPC client layer, but after seeing the patch, I think this is the right thing to do. The protobuf parseDelimitedFrom method is documented to return null if the input stream is already at EOF, so semantically, EOFException is the right error code. This change may also benefit other RPC clients, such as YARN's RMProxy , where there is a retry policy associated with EOFException . Since this is a change lower down at the RPC layer, I'd like to wait until next week to commit, in case anyone else wants to review. I'm also notifying szetszwo , who originally worked on this code for HDFS-3504 (configurable retry policies for DFSClient). Nicholas, do you see any problem with making this change? You'll need to update the patch one more time. The method signature of sendHeartbeat changed recently. You'll need to add one more parameter to that call in the test, and it can be set to Mockito.any(VolumeFailureSummary.class) . There are also some typos: "mokito" instead of "mockito". Let's correct those. The test failure in the last Jenkins run appears to be unrelated. Thanks again for your work on this, Ming!

Ming Ma added a comment - 21/Feb/15 03:51

Thanks, Chris. Here is the updated patch.

Nicholas can confirm, FailoverOnNetworkExceptionRetry defined in RetryPolicies handles IOException that isn't RemoteException. So this change shouldn't change that behavior.

Ming Ma added a comment - 21/Feb/15 03:51 Thanks, Chris. Here is the updated patch. Nicholas can confirm, FailoverOnNetworkExceptionRetry defined in RetryPolicies handles IOException that isn't RemoteException . So this change shouldn't change that behavior.

Hadoop QA added a comment - 21/Feb/15 07:37

+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12700000/HDFS-7009-4.patch
against trunk revision 6f01330.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 1 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//console

This message is automatically generated.

Hadoop QA added a comment - 21/Feb/15 07:37 +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700000/HDFS-7009-4.patch against trunk revision 6f01330. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//console This message is automatically generated.

Tsz-wo Sze added a comment - 21/Feb/15 15:33

It looks like that throwing EOFException is a good choice since other methods such as in.readInt() also throws EOFException. I do have a question – In receiveRpcResponse, it first read totalLen, then the rpc response header and then the rpc body as shown below. Is there any reason that the input steam ends right after reading the totalLen or just a coincidence?

        int totalLen = in.readInt();
        RpcResponseHeaderProto header = 
            RpcResponseHeaderProto.parseDelimitedFrom(in);
        checkResponse(header);
        ...
          value.readFields(in);                 // read value

Tsz-wo Sze added a comment - 21/Feb/15 15:33 It looks like that throwing EOFException is a good choice since other methods such as in.readInt() also throws EOFException. I do have a question – In receiveRpcResponse, it first read totalLen, then the rpc response header and then the rpc body as shown below. Is there any reason that the input steam ends right after reading the totalLen or just a coincidence? int totalLen = in.readInt(); RpcResponseHeaderProto header = RpcResponseHeaderProto.parseDelimitedFrom(in); checkResponse(header); ... value.readFields(in); // read value

Chris Nauroth added a comment - 21/Feb/15 21:54

szetszwo, thank you for taking a look.

Is there any reason that the input stream ends right after reading the totalLen or just a coincidence?

Good question. Ultimately, this was just a coincidence of a DataNode trying to register during a poorly timed NameNode restart. Both Ming and I have observed slightly different versions of this problem. ~~HDFS-7714~~ fixed the problem I saw by handling EOFException during registration, but we still need Ming's patch here to cover the slightly different problem he saw.

There are 4 separate cases to consider:

DataNode connects to NameNode and sends registration request. NameNode shuts down and terminates socket connection before writing any RPC response bytes. At the DataNode, the RPC client observes this as an EOFException thrown from the DataInputStream#readInt call. With ~~HDFS-7714~~, this case is handled correctly.
DataNode connects to NameNode. NameNode sends response length and starts sending a response header, but it shuts down and terminates the socket connection before writing the complete response header. The contract of parseDelimitedFrom states that unexpected EOF part-way through parsing will propagate an EOFException to the caller. At the DataNode, the RPC client observes the EOFException and therefore ~~HDFS-7714~~ handles this case correctly too.
DataNode connects to NameNode. NameNode sends response length and complete response header, and then starts writing the response body, but shuts down and terminates the socket connection before writing the complete response body. At the DataNode, the RPC client observes EOFException while trying to read the response body bytes, and therefore ~~HDFS-7714~~ handles this case correctly too.
DataNode connects to NameNode. NameNode sends only response length, and then shuts down and terminates the socket connection before sending anything else. The contract of parseDelimitedFrom states that if the stream is already positioned at EOF, then the return value is null. At the DataNode, the current RPC client code handles this case by throwing IOException. This isn't sufficient information for the DataNode to know if it's safe to reattempt registration, even with ~~HDFS-7714~~, so this is still a registration failure.

Here is the documentation for parseDelimitedFrom:

https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/AbstractParser#parseDelimitedFrom(java.io.InputStream)

It's probably a documentation bug that they say the return value is false. Here is the actual protobuf code from AbstractParser#parsePartialDelimitedFrom where we see it checking the stream for EOF and returning null before attempting to parse:

  public MessageType parsePartialDelimitedFrom(
      InputStream input,
      ExtensionRegistryLite extensionRegistry)
      throws InvalidProtocolBufferException {
    int size;
    try {
      int firstByte = input.read();
      if (firstByte == -1) {
        return null;
      }
      size = CodedInputStream.readRawVarint32(firstByte, input);
    } catch (IOException e) {
      throw new InvalidProtocolBufferException(e.getMessage());
    }
    InputStream limitedInput = new LimitedInputStream(input, size);
    return parsePartialFrom(limitedInput, extensionRegistry);
  }

To summarize, ~~HDFS-7714~~ is sufficient to handle cases 1-3, but we still need Ming's patch here for correct handling of case 4. I also think it's correct behavior for all RPC clients, not just the specific case of DataNode registration.

Chris Nauroth added a comment - 21/Feb/15 21:54 szetszwo , thank you for taking a look. Is there any reason that the input stream ends right after reading the totalLen or just a coincidence? Good question. Ultimately, this was just a coincidence of a DataNode trying to register during a poorly timed NameNode restart. Both Ming and I have observed slightly different versions of this problem. HDFS-7714 fixed the problem I saw by handling EOFException during registration, but we still need Ming's patch here to cover the slightly different problem he saw. There are 4 separate cases to consider: DataNode connects to NameNode and sends registration request. NameNode shuts down and terminates socket connection before writing any RPC response bytes. At the DataNode, the RPC client observes this as an EOFException thrown from the DataInputStream#readInt call. With HDFS-7714 , this case is handled correctly. DataNode connects to NameNode. NameNode sends response length and starts sending a response header, but it shuts down and terminates the socket connection before writing the complete response header. The contract of parseDelimitedFrom states that unexpected EOF part-way through parsing will propagate an EOFException to the caller. At the DataNode, the RPC client observes the EOFException and therefore HDFS-7714 handles this case correctly too. DataNode connects to NameNode. NameNode sends response length and complete response header, and then starts writing the response body, but shuts down and terminates the socket connection before writing the complete response body. At the DataNode, the RPC client observes EOFException while trying to read the response body bytes, and therefore HDFS-7714 handles this case correctly too. DataNode connects to NameNode. NameNode sends only response length, and then shuts down and terminates the socket connection before sending anything else. The contract of parseDelimitedFrom states that if the stream is already positioned at EOF, then the return value is null . At the DataNode, the current RPC client code handles this case by throwing IOException . This isn't sufficient information for the DataNode to know if it's safe to reattempt registration, even with HDFS-7714 , so this is still a registration failure. Here is the documentation for parseDelimitedFrom : https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/AbstractParser#parseDelimitedFrom(java.io.InputStream ) It's probably a documentation bug that they say the return value is false . Here is the actual protobuf code from AbstractParser#parsePartialDelimitedFrom where we see it checking the stream for EOF and returning null before attempting to parse: public MessageType parsePartialDelimitedFrom( InputStream input, ExtensionRegistryLite extensionRegistry) throws InvalidProtocolBufferException { int size; try { int firstByte = input.read(); if (firstByte == -1) { return null ; } size = CodedInputStream.readRawVarint32(firstByte, input); } catch (IOException e) { throw new InvalidProtocolBufferException(e.getMessage()); } InputStream limitedInput = new LimitedInputStream(input, size); return parsePartialFrom(limitedInput, extensionRegistry); } To summarize, HDFS-7714 is sufficient to handle cases 1-3, but we still need Ming's patch here for correct handling of case 4. I also think it's correct behavior for all RPC clients, not just the specific case of DataNode registration.

Tsz-wo Sze added a comment - 22/Feb/15 15:24

cnauroth, thanks for the detailed explanation.

+1 on the patch.

Tsz-wo Sze added a comment - 22/Feb/15 15:24 cnauroth , thanks for the detailed explanation. +1 on the patch.

Chris Nauroth added a comment - 23/Feb/15 17:24

+1 from me too. I'll commit this later today.

Chris Nauroth added a comment - 23/Feb/15 17:24 +1 from me too. I'll commit this later today.

Chris Nauroth added a comment - 23/Feb/15 23:17

I have committed this to trunk and branch-2. Ming, thank you for contributing the patch. Arpit and Nicholas, thank you for your help on the code review.

Chris Nauroth added a comment - 23/Feb/15 23:17 I have committed this to trunk and branch-2. Ming, thank you for contributing the patch. Arpit and Nicholas, thank you for your help on the code review.

Hudson added a comment - 23/Feb/15 23:23

FAILURE: Integrated in Hadoop-trunk-Commit #7178 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7178/)
~~HDFS-7009~~. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java

Hudson added a comment - 23/Feb/15 23:23 FAILURE: Integrated in Hadoop-trunk-Commit #7178 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7178/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java

Ming Ma added a comment - 23/Feb/15 23:26

Thanks, Chris, Arpit and Nicholas.

Ming Ma added a comment - 23/Feb/15 23:26 Thanks, Chris, Arpit and Nicholas.

Hudson added a comment - 24/Feb/15 11:32

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/114/)
~~HDFS-7009~~. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java

Hudson added a comment - 24/Feb/15 11:32 FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/114/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java

Hudson added a comment - 24/Feb/15 11:54

SUCCESS: Integrated in Hadoop-Yarn-trunk #848 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/848/)
~~HDFS-7009~~. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 11:54 SUCCESS: Integrated in Hadoop-Yarn-trunk #848 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/848/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 14:19

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #105 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/105/)
~~HDFS-7009~~. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 14:19 FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #105 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/105/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 14:30

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2046 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2046/)
~~HDFS-7009~~. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 14:30 SUCCESS: Integrated in Hadoop-Hdfs-trunk #2046 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2046/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 15:06

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/114/)
~~HDFS-7009~~. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 15:06 FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/114/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 15:21

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2064 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2064/)
~~HDFS-7009~~. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)

hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Hudson added a comment - 24/Feb/15 15:21 FAILURE: Integrated in Hadoop-Mapreduce-trunk #2064 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2064/ ) HDFS-7009 . Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

Vinod Kumar Vavilapalli added a comment - 01/Sep/15 18:26

sjlee0 backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation and TestDatanodeProtocolRetryPolicy which changed in the patch.

Vinod Kumar Vavilapalli added a comment - 01/Sep/15 18:26 sjlee0 backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation and TestDatanodeProtocolRetryPolicy which changed in the patch.

People

Assignee:: Ming Ma

Reporter:: Ming Ma

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 06/Sep/14 00:07

Updated:: 30/Aug/16 01:41

Resolved:: 23/Feb/15 23:17

Hadoop HDFS

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates