Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
3.6.0
-
None
-
None
Description
This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues have been implemented in CommitProcessor to improve performance. abrahamfine 's analysis is accurate on GitHub: https://github.com/apache/zookeeper/pull/300
He was trying to introduce synchronization between Learner.syncWithLeader() and CommitProcessor to wait for in-flight requests to be committed before accepting client requests.
In the affected unit test (testNodeDataChanged) there's a race between reconnecting client's setWatches request and updates coming from the leader according to the following logs:
2018-10-25 13:59:58,556 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2018-10-25 13:59:58,556 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a ... 2018-10-25 13:59:58,557 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2 reqpath:n/a ... 2018-10-25 13:59:58,561 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1 reqpath:n/a 2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for sessionid 0x10005d8fc4d0000
setWatches request is processed before delete and create, hence the client receives NodeDeleted event.
In the working scenario it looks like:
2018-10-25 14:04:55,247 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005dd88110000 type:delete cxid: 0x1 zxid:0x100000004 txntype:2 reqpath:n/a 2018-10-25 14:04:55,249 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005dd88110000 type:create cxid: 0x2 zxid:0x100000005 txntype:1 reqpath:n/a ... 2018-10-25 14:04:56,314 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 2018-10-25 14:04:56,315 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a ... 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got notification sessionid:0x10005dd88110000 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for sessionid 0x10005dd88110000
delete and create requests happen way before setWatches comes in (even before the client connection is established) and client receives NodeDataChanged event only.
Abe's approach unfortunately raises the following concerns:
- modifies CommitProcessor's code which might affect performance and correctness (shralex raised on
ZOOKEEPER-2807), - we experienced deadlocks while testing the patch: https://github.com/apache/zookeeper/pull/300
As a consequence I raised this Jira to capture the experiences and to put the unit test on Ignore list, because currently I'm not sure about whether this is a real issue or a non-backward compatible change in 3.6 with the gain of a huge performance improvement.
Either way I don't want this flaky test to influence contributions, so I'll mark as Ignored on trunk until the issue is resolved.
Attachments
Issue Links
- causes
-
ZOOKEEPER-2807 Flaky test: org.apache.zookeeper.test.WatchEventWhenAutoResetTest.testNodeDataChanged
- Resolved
- is broken by
-
ZOOKEEPER-2024 Major throughput improvement with mixed workloads
- Resolved