Description
DataTree.serializeNode synchronizes on the DataNode it is about to serialize then writes it out via OutputArchive.writeRecord, potentially to a network connection. Under default linux TCP settings, a network connection where the other side completely disappears will hang (blocking on the java.net.SocketOutputStream.socketWrite0 call) for over 15 minutes. During this time, any attempt to create/delete/modify the DataNode will cause the leader to hang at the beginning of the request processor chain:
"ProcessThread(sid:5 cport:-1):" prio=10 tid=0x00000000026f1800 nid=0x379c waiting for monitor entry [0x00007fe6c2a8c000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:163) - waiting to lock <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode) - locked <0x00000000d2ef81d0> (a java.util.ArrayList) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:345) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:534) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)
Additionally, any attempt to send a snapshot to a follower or to disk will hang.
Because the ping packets are sent by another thread which is unaffected, followers never time out and become leader, even though the cluster will make no progress until either the leader is killed or the TCP connection times out. This isn't exactly a deadlock since it will resolve itself eventually, but as mentioned above this will take > 15 minutes with the default TCP retry settings in linux.
A simple solution to this is: in DataTree.serializeNode we can take a copy of the contents of the DataNode (as is done with its children) in the synchronized block, then call writeRecord with the copy of the DataNode outside of the original DataNode synchronized block.