Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3182

Race condition when follower syncing with leader and starting to serve requests

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.6.0
    • None
    • server
    • None

    Description

      This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues have been implemented in CommitProcessor to improve performance. abrahamfine 's analysis is accurate on GitHub: https://github.com/apache/zookeeper/pull/300

      He was trying to introduce synchronization between Learner.syncWithLeader() and CommitProcessor to wait for in-flight requests to be committed before accepting client requests.

      In the affected unit test (testNodeDataChanged) there's a race between reconnecting client's setWatches request and updates coming from the leader according to the following logs:

      2018-10-25 13:59:58,556 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
      2018-10-25 13:59:58,556 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
      ...
      2018-10-25 13:59:58,557 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2 reqpath:n/a
      ...
      2018-10-25 13:59:58,561 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1 reqpath:n/a
      2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for sessionid 0x10005d8fc4d0000
      

      setWatches request is processed before delete and create, hence the client receives NodeDeleted event.

      In the working scenario it looks like:

      2018-10-25 14:04:55,247 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005dd88110000 type:delete cxid:
      0x1 zxid:0x100000004 txntype:2 reqpath:n/a
      2018-10-25 14:04:55,249 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005dd88110000 type:create cxid:
      0x2 zxid:0x100000005 txntype:1 reqpath:n/a
      ...
      2018-10-25 14:04:56,314 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
      2018-10-25 14:04:56,315 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
      ...
      2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got notification sessionid:0x10005dd88110000
      2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for sessionid 0x10005dd88110000
      

      delete and create requests happen way before setWatches comes in (even before the client connection is established) and client receives NodeDataChanged event only.

      Abe's approach unfortunately raises the following concerns:

      As a consequence I raised this Jira to capture the experiences and to put the unit test on Ignore list, because currently I'm not sure about whether this is a real issue or a non-backward compatible change in 3.6 with the gain of a huge performance improvement.

      Either way I don't want this flaky test to influence contributions, so I'll mark as Ignored on trunk until the issue is resolved.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              andor Andor Molnar
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: