[ZOOKEEPER-2201] Network issues can cause cluster to hang due to near-deadlock - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.4.6, 3.5.0
Fix Version/s: 3.4.7, 3.5.2, 3.6.0
Component/s: None
Labels:
None

Description

DataTree.serializeNode synchronizes on the DataNode it is about to serialize then writes it out via OutputArchive.writeRecord, potentially to a network connection. Under default linux TCP settings, a network connection where the other side completely disappears will hang (blocking on the java.net.SocketOutputStream.socketWrite0 call) for over 15 minutes. During this time, any attempt to create/delete/modify the DataNode will cause the leader to hang at the beginning of the request processor chain:

"ProcessThread(sid:5 cport:-1):" prio=10 tid=0x00000000026f1800 nid=0x379c waiting for monitor entry [0x00007fe6c2a8c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:163)
        - waiting to lock <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode)
        - locked <0x00000000d2ef81d0> (a java.util.ArrayList)
        at org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:345)
        at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:534)
        at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:131)

Additionally, any attempt to send a snapshot to a follower or to disk will hang.

Because the ping packets are sent by another thread which is unaffected, followers never time out and become leader, even though the cluster will make no progress until either the leader is killed or the TCP connection times out. This isn't exactly a deadlock since it will resolve itself eventually, but as mentioned above this will take > 15 minutes with the default TCP retry settings in linux.

A simple solution to this is: in DataTree.serializeNode we can take a copy of the contents of the DataNode (as is done with its children) in the synchronized block, then call writeRecord with the copy of the DataNode outside of the original DataNode synchronized block.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ZOOKEEPER-2201.patch
02/Jun/15 01:32
5 kB
Donny Nadolny
ZOOKEEPER-2201.patch
03/Jun/15 14:17
5 kB
Donny Nadolny
ZOOKEEPER-2201-branch-34.patch
03/Jun/15 14:19
5 kB
Donny Nadolny
ZOOKEEPER-2201.patch
03/Jun/15 14:37
5 kB
Donny Nadolny
ZOOKEEPER-2201.patch
03/Jun/15 16:07
5 kB
Donny Nadolny
ZOOKEEPER-2201.patch
04/Jun/15 21:38
5 kB
Donny Nadolny

Activity

People

Assignee:: Donny Nadolny

Reporter:: Donny Nadolny

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Jun/15 01:30

Updated:: 21/Jul/16 20:18

Resolved:: 06/Jun/15 16:54