[HDFS-6179] Synchronized BPOfferService - datanode locks for slow namenode reply. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: datanode, namenode
Labels:
None

Description

Scenario:

600 ative DNs
1 active NN
HA configuration

When we start SbNN because of huge number of blocks and relative small initialDelay - SbNN during startup will go through multiple stop-the-world garbage collections processes (in minutes - Namenode heap size is 75GB). We've observed that SbNN slowness affects active NN so active NN is losing DNs (DNs are considered dead due to lack of heartbeats). We assume that some DNs are hanging.

When DN is considered dead by active Namenode, we've observed "dead lock" in DN process, part of stack trace:

"DataNode: [file:/disk1,file:/disk2]  heartbeating to standbynamenode.net/10.10.10.10:8020" daemon prio=10 tid=0x00007ff429417800 nid=0x7f2a in Object.wait() [0x00007ff42122c000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.hadoop.ipc.Client.call(Client.java:1333)
        - locked <0x00000007db94e4c8> (a org.apache.hadoop.ipc.Client$Call)
        at org.apache.hadoop.ipc.Client.call(Client.java:1300)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at $Proxy9.registerDatanode(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at $Proxy9.registerDatanode(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:740)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromStandby(BPOfferService.java:603)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:506)
        - locked <0x0000000780006e08> (a org.apache.hadoop.hdfs.server.datanode.BPOfferService)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:704)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
        at java.lang.Thread.run(Thread.java:662)

"DataNode: [file:/disk1,file:/disk2]  heartbeating to activenamenode.net/10.10.10.11:8020" daemon prio=10 tid=0x00007ff428a24000 nid=0x7f29 waiting for monitor entry [0x00007ff42132e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.updateActorStatesFromHeartbeat(BPOfferService.java:413)
        - waiting to lock <0x0000000780006e08> (a org.apache.hadoop.hdfs.server.datanode.BPOfferService)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:535)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
        at java.lang.Thread.run(Thread.java:662)

Notice that it's the same lock - due to synchronization at BPOfferService. The problem is that command from standby can't be process due to unresponsive standby Namenode, nevertheless DN is waiting for reply from SbNN, and is waiting long enough to be considered dead by active namenode.

Info: if we kill SbNN, DN will instantly reconnect to active NN.

Attachments

Issue Links

duplicates

HDFS-5014 BPOfferService#processCommandFromActor() synchronization on namenode RPC call delays IBR to Active NN, if Stanby NN is unstable

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Rafal Wojdyla

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/Apr/14 15:05

Updated:: 02/Apr/14 02:06

Resolved:: 02/Apr/14 02:06