[HDFS-7009] Active NN and standby NN have different live nodes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.7.0, 2.6.1, 3.0.0-alpha1
Component/s: datanode
Labels:
- 2.6.1-candidate

Target Version/s:

2.7.0
Hadoop Flags:

Reviewed

Description

To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most cases, given DN sends HB and BR to NN regularly, if a specific RPC call fails, it isn't a big deal.

However, there are cases where DN fails to register with NN during initial handshake due to exceptions not covered by RPC client's connection retry. When this happens, the DN won't talk to that NN until the DN restarts.

BPServiceActor

  public void run() {
    LOG.info(this + " starting to offer service");

    try {
      // init stuff
      try {
        // setup storage
        connectToNNAndHandshake();
      } catch (IOException ioe) {
        // Initial handshake, storage recovery or registration failed
        // End BPOfferService thread
        LOG.fatal("Initialization failed for block pool " + this, ioe);
        return;
      }

      initialized = true; // bp is initialized;
      
      while (shouldRun()) {
        try {
          offerService();
        } catch (Exception ex) {
          LOG.error("Exception in BPOfferService for " + this, ex);
          sleepAndLogInterrupts(5000, "offering service");
        }
      }
...

Here is an example of the call stack.

java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "xxx"; destination host is: "yyy":8030;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
        at org.apache.hadoop.ipc.Client.call(Client.java:1239)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
        at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Response is null.
        at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)

This will create discrepancy between active NN and standby NN in terms of live nodes.

Here is a possible scenario of missing blocks after failover.

1. DN A, B set up handshakes with active NN, but not with standby NN.
2. A block is replicated to DN A, B and C.
3. From standby NN's point of view, given A and B are dead nodes, the block is under replicated.
4. DN C is down.
5. Before active NN detects DN C is down, it fails over.
6. The new active NN considers the block is missing. Even though there are two replicas on DN A and B.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-7009.patch
27/Sep/14 05:59
10 kB
Ming Ma
HDFS-7009-2.patch
04/Oct/14 04:05
10 kB
Ming Ma
HDFS-7009-3.patch
14/Feb/15 06:00
10 kB
Ming Ma
HDFS-7009-4.patch
21/Feb/15 03:51
10 kB
Ming Ma

Issue Links

relates to

HDFS-2882 DN continues to start up, even if block pool fails to initialize

Closed

HDFS-7714 Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully with only one NameNode.

Closed

Activity

People

Assignee:: Ming Ma

Reporter:: Ming Ma

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 06/Sep/14 00:07

Updated:: 30/Aug/16 01:41

Resolved:: 23/Feb/15 23:17