Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.6.0
-
Reviewed
Description
To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most cases, given DN sends HB and BR to NN regularly, if a specific RPC call fails, it isn't a big deal.
However, there are cases where DN fails to register with NN during initial handshake due to exceptions not covered by RPC client's connection retry. When this happens, the DN won't talk to that NN until the DN restarts.
BPServiceActor public void run() { LOG.info(this + " starting to offer service"); try { // init stuff try { // setup storage connectToNNAndHandshake(); } catch (IOException ioe) { // Initial handshake, storage recovery or registration failed // End BPOfferService thread LOG.fatal("Initialization failed for block pool " + this, ioe); return; } initialized = true; // bp is initialized; while (shouldRun()) { try { offerService(); } catch (Exception ex) { LOG.error("Exception in BPOfferService for " + this, ex); sleepAndLogInterrupts(5000, "offering service"); } } ...
Here is an example of the call stack.
java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "xxx"; destination host is: "yyy":8030; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761) at org.apache.hadoop.ipc.Client.call(Client.java:1239) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Response is null. at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
This will create discrepancy between active NN and standby NN in terms of live nodes.
Here is a possible scenario of missing blocks after failover.
1. DN A, B set up handshakes with active NN, but not with standby NN.
2. A block is replicated to DN A, B and C.
3. From standby NN's point of view, given A and B are dead nodes, the block is under replicated.
4. DN C is down.
5. Before active NN detects DN C is down, it fails over.
6. The new active NN considers the block is missing. Even though there are two replicas on DN A and B.
Attachments
Attachments
- HDFS-7009.patch
- 10 kB
- Ming Ma
- HDFS-7009-2.patch
- 10 kB
- Ming Ma
- HDFS-7009-3.patch
- 10 kB
- Ming Ma
- HDFS-7009-4.patch
- 10 kB
- Ming Ma
Issue Links
Activity
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12671618/HDFS-7009.patch
against trunk revision 5f16c98.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8235//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch
against trunk revision 5f16c98.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8237//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12671624/HDFS-7009.patch
against trunk revision 2d8e6e2.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8309//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672779/HDFS-7009-2.patch
against trunk revision 2d8e6e2.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. Failed to build the native portion of hadoop-common prior to running the unit tests in hadoop-hdfs-project/hadoop-hdfs
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8312//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672870/HDFS-7009-2.patch
against trunk revision 7f6ed7f.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
-1 javac. The applied patch generated 1280 javac compiler warnings (more than the trunk's current 1266 warnings).
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch
against trunk revision bbb3b1a.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8323//console
This message is automatically generated.
Hi mingma, thanks for reporting this issue and posting the patch. Does this bug still exist after 2.4.1?
It looks like BPServiceActor#run has a retry loop added by HDFS-2882.
public void run() { try { while (true) { // init stuff try { // setup storage connectToNNAndHandshake(); break; } catch (IOException ioe) { // Initial handshake, storage recovery or registration failed runningState = RunningState.INIT_FAILED; if (shouldRetryInit()) { // Retry until all namenode's of BPOS failed initialization LOG.error("Initialization failed for " + this + " " + ioe.getLocalizedMessage()); sleepAndLogInterrupts(5000, "initializing");
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12672908/HDFS-7009-2.patch
against trunk revision 2f1e5dc.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9569//console
This message is automatically generated.
Thanks, arpitagarwal. The patch seems to be useful even after HDFS-2882 as it handles exception outside after initialization. Actually it looks quite like the patch in HDFS-7714 from vinayrpet and cnauroth. HDFS-7714 catches only EOFException; but in the call stack above it comes from throw new IOException("Response is null."); in RPC Client.
The patch seems to be useful even after
HDFS-2882as it handles exception outside after initialization.
Thanks for the response Ming, are you referring to reRegister?
Hi mingma. Thanks for giving me the notification, and I'm sorry I didn't spot this before I filed HDFS-7714. You're right that it's very similar.
I think it's helpful that your patch switches from whitelisting a set of acceptable errors (potentially unpredictable) to blacklisting known fatal errors (well-defined as DisallowedDatanodeException).
I don't think we need a configurable maximum retry count. Error handling in the DataNode/NameNode connection traditionally has been handled with infinite retries. This keeps the DataNode process up and running and robust against unplanned NameNode downtime. Let me know if you disagree on this point.
If you want to rebase the patch, I think it would be valuable to get it in. Thanks again!
Thanks, Arpit. Yes, I meant reRegister.
Thanks, Chris. I agree with both of your points. Here is the updated patch. The fix is to return specific exception from RPC client; it appears EOFException is good choice for this specific scenario.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12698889/HDFS-7009-3.patch
against trunk revision 6804d68.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//console
This message is automatically generated.
Thanks for updating the patch, Ming. I wasn't thinking of a fix at the RPC client layer, but after seeing the patch, I think this is the right thing to do. The protobuf parseDelimitedFrom method is documented to return null if the input stream is already at EOF, so semantically, EOFException is the right error code. This change may also benefit other RPC clients, such as YARN's RMProxy, where there is a retry policy associated with EOFException.
Since this is a change lower down at the RPC layer, I'd like to wait until next week to commit, in case anyone else wants to review. I'm also notifying szetszwo, who originally worked on this code for HDFS-3504 (configurable retry policies for DFSClient). Nicholas, do you see any problem with making this change?
You'll need to update the patch one more time. The method signature of sendHeartbeat changed recently. You'll need to add one more parameter to that call in the test, and it can be set to Mockito.any(VolumeFailureSummary.class). There are also some typos: "mokito" instead of "mockito". Let's correct those.
The test failure in the last Jenkins run appears to be unrelated.
Thanks again for your work on this, Ming!
Thanks, Chris. Here is the updated patch.
Nicholas can confirm, FailoverOnNetworkExceptionRetry defined in RetryPolicies handles IOException that isn't RemoteException. So this change shouldn't change that behavior.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12700000/HDFS-7009-4.patch
against trunk revision 6f01330.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9633//console
This message is automatically generated.
It looks like that throwing EOFException is a good choice since other methods such as in.readInt() also throws EOFException. I do have a question – In receiveRpcResponse, it first read totalLen, then the rpc response header and then the rpc body as shown below. Is there any reason that the input steam ends right after reading the totalLen or just a coincidence?
int totalLen = in.readInt(); RpcResponseHeaderProto header = RpcResponseHeaderProto.parseDelimitedFrom(in); checkResponse(header); ... value.readFields(in); // read value
szetszwo, thank you for taking a look.
Is there any reason that the input stream ends right after reading the totalLen or just a coincidence?
Good question. Ultimately, this was just a coincidence of a DataNode trying to register during a poorly timed NameNode restart. Both Ming and I have observed slightly different versions of this problem. HDFS-7714 fixed the problem I saw by handling EOFException during registration, but we still need Ming's patch here to cover the slightly different problem he saw.
There are 4 separate cases to consider:
- DataNode connects to NameNode and sends registration request. NameNode shuts down and terminates socket connection before writing any RPC response bytes. At the DataNode, the RPC client observes this as an EOFException thrown from the DataInputStream#readInt call. With
HDFS-7714, this case is handled correctly. - DataNode connects to NameNode. NameNode sends response length and starts sending a response header, but it shuts down and terminates the socket connection before writing the complete response header. The contract of parseDelimitedFrom states that unexpected EOF part-way through parsing will propagate an EOFException to the caller. At the DataNode, the RPC client observes the EOFException and therefore
HDFS-7714handles this case correctly too. - DataNode connects to NameNode. NameNode sends response length and complete response header, and then starts writing the response body, but shuts down and terminates the socket connection before writing the complete response body. At the DataNode, the RPC client observes EOFException while trying to read the response body bytes, and therefore
HDFS-7714handles this case correctly too. - DataNode connects to NameNode. NameNode sends only response length, and then shuts down and terminates the socket connection before sending anything else. The contract of parseDelimitedFrom states that if the stream is already positioned at EOF, then the return value is null. At the DataNode, the current RPC client code handles this case by throwing IOException. This isn't sufficient information for the DataNode to know if it's safe to reattempt registration, even with
HDFS-7714, so this is still a registration failure.
Here is the documentation for parseDelimitedFrom:
It's probably a documentation bug that they say the return value is false. Here is the actual protobuf code from AbstractParser#parsePartialDelimitedFrom where we see it checking the stream for EOF and returning null before attempting to parse:
public MessageType parsePartialDelimitedFrom( InputStream input, ExtensionRegistryLite extensionRegistry) throws InvalidProtocolBufferException { int size; try { int firstByte = input.read(); if (firstByte == -1) { return null; } size = CodedInputStream.readRawVarint32(firstByte, input); } catch (IOException e) { throw new InvalidProtocolBufferException(e.getMessage()); } InputStream limitedInput = new LimitedInputStream(input, size); return parsePartialFrom(limitedInput, extensionRegistry); }
To summarize, HDFS-7714 is sufficient to handle cases 1-3, but we still need Ming's patch here for correct handling of case 4. I also think it's correct behavior for all RPC clients, not just the specific case of DataNode registration.
I have committed this to trunk and branch-2. Ming, thank you for contributing the patch. Arpit and Nicholas, thank you for your help on the code review.
FAILURE: Integrated in Hadoop-trunk-Commit #7178 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7178/)
HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)
- hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
- hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/114/)
HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)
- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
- hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
- hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
SUCCESS: Integrated in Hadoop-Yarn-trunk #848 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/848/)
HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)
- hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
- hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #105 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/105/)
HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)
- hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
- hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
SUCCESS: Integrated in Hadoop-Hdfs-trunk #2046 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2046/)
HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)
- hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
- hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #114 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/114/)
HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)
- hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
- hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
FAILURE: Integrated in Hadoop-Mapreduce-trunk #2064 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2064/)
HDFS-7009. Active NN and standby NN have different live nodes. Contributed by Ming Ma. (cnauroth: rev 769507bd7a501929d9a2fd56c72c3f50673488a4)
- hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDatanodeProtocolRetryPolicy.java
- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
- hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
sjlee0 backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation and TestDatanodeProtocolRetryPolicy which changed in the patch.
Given there are existing retries inside BPServiceActor, the patch just add additional retry in BPServiceActor.
The policy is to retry for configurable max number of times in the case of IOException that isn't RemoteException. In that way, it will cover the common case of IOException caused by network issue. If NN throws DisallowedDatanodeException exception, it will be wrapped under RemoteException; BPServiceActor won't retry in that scenario.
Note that this issue can happen outside NN startup time. When NN lost heartbeat from the DN and DN reconnect with NN later, reregistration can throw IOException due to network issue and subsequent incremental BR RPC will fail with UnregisteredNodeException; that will cause BPServiceActor to shutdown.