Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2845

Data inconsistency issue due to retain database in leader election

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.4.10, 3.5.3, 3.6.0
    • Fix Version/s: None
    • Component/s: quorum
    • Labels:
      None

      Description

      In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time during leader election. In ZooKeeper ensemble, it's possible that the snapshot is ahead of txn file (due to slow disk on the server, etc), or the txn file is ahead of snapshot due to no commit message being received yet.

      If snapshot is ahead of txn file, since the SyncRequestProcessor queue will be drained during shutdown, the snapshot and txn file will keep consistent before leader election happening, so this is not an issue.

      But if txn is ahead of snapshot, it's possible that the ensemble will have data inconsistent issue, here is the simplified scenario to show the issue:

      Let's say we have a 3 servers in the ensemble, server A and B are followers, and C is leader, and all the snapshot and txn are up to T0:
      1. A new request reached to leader C to create Node N, and it's converted to txn T1
      2. Txn T1 was synced to disk in C, but just before the proposal reaching out to the followers, A and B restarted, so the T1 didn't exist in A and B
      3. A and B formed a new quorum after restart, let's say B is the leader
      4. C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync
      5. Before C take snapshot it restarted, it replayed the txns on disk which includes T1, now it will have Node N, but A and B doesn't have it.

      Also I included the a test case to reproduce this issue consistently.

      We have a totally different RetainDB version which will avoid this issue by doing consensus between snapshot and txn files before leader election, will submit for review.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user lvfangmin opened a pull request:

          https://github.com/apache/zookeeper/pull/310

          ZOOKEEPER-2845[Test] Test used to reproduce the data inconsistency issue due to retain database in leader election

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-2845-TEST

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/zookeeper/pull/310.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #310


          commit ff0bc49de51635da1d5bff0e4f260a61acc87db0
          Author: Fangmin Lyu <allenlyu@fb.com>
          Date: 2017-07-14T23:02:20Z

          reproduce the data inconsistency issue


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user lvfangmin opened a pull request: https://github.com/apache/zookeeper/pull/310 ZOOKEEPER-2845 [Test] Test used to reproduce the data inconsistency issue due to retain database in leader election You can merge this pull request into a Git repository by running: $ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-2845 -TEST Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #310 commit ff0bc49de51635da1d5bff0e4f260a61acc87db0 Author: Fangmin Lyu <allenlyu@fb.com> Date: 2017-07-14T23:02:20Z reproduce the data inconsistency issue
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. GitHub Pull Request Build

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/883//console This message is automatically generated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user lvfangmin commented on a diff in the pull request:

          https://github.com/apache/zookeeper/pull/310#discussion_r127567852

          — Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java —
          @@ -784,4 +784,126 @@ public void testWithOnlyMinSessionTimeout() throws Exception

          { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); }

          + @Test
          + public void testTxnAheadSnapInRetainDB() throws Exception {
          + // 1. start up server and wait for leader election to finish
          + ClientBase.setupTestEnv();
          + final int SERVER_COUNT = 3;
          + final int clientPorts[] = new int[SERVER_COUNT];
          + StringBuilder sb = new StringBuilder();
          + for(int i = 0; i < SERVER_COUNT; i++)

          { + clientPorts[i] = PortAssignment.unique(); + sb.append("server."+i+"=127.0.0.1:"+PortAssignment.unique()+":"+PortAssignment.unique()+";"+clientPorts[i]+"\n"); + }

          + String quorumCfgSection = sb.toString();
          +
          + MainThread mt[] = new MainThread[SERVER_COUNT];
          + ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT];
          + for (int i = 0; i < SERVER_COUNT; i++)

          { + mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); + mt[i].start(); + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + }

          +
          + waitForAll(zk, States.CONNECTED);
          +
          + // we need to shutdown and start back up to make sure that the create session isn't the first transaction since
          + // that is rather innocuous.
          + for (int i = 0; i < SERVER_COUNT; i++)

          { + mt[i].shutdown(); + }

          +
          + waitForAll(zk, States.CONNECTING);
          +
          + for (int i = 0; i < SERVER_COUNT; i++)

          { + mt[i].start(); + // Recreate a client session since the previous session was not persisted. + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + }

          +
          + waitForAll(zk, States.CONNECTED);
          +
          +
          + // 2. kill all followers
          + int leader = -1;
          + Map<Long, Proposal> outstanding = null;
          + for (int i = 0; i < SERVER_COUNT; i++) {
          + if (mt[i].main.quorumPeer.leader != null)

          { + leader = i; + outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; + // increase the tick time to delay the leader going to looking + mt[leader].main.quorumPeer.tickTime = 10000; + }

          + }
          +
          + for (int i = 0; i < SERVER_COUNT; i++) {
          + if (i != leader)

          { + mt[i].shutdown(); + }

          + }
          +
          + // 3. start up the followers to form a new quorum
          + for (int i = 0; i < SERVER_COUNT; i++) {
          + if (i != leader)

          { + mt[i].start(); + }

          + }
          +
          + // 4. wait one of the follower to be the leader
          + for (int i = 0; i < SERVER_COUNT; i++) {
          + if (i != leader)

          { + // Recreate a client session since the previous session was not persisted. + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + waitForOne(zk[i], States.CONNECTED); + }

          + }
          +
          + // 5. send a create request to leader and make sure it's synced to disk,
          + // which means it acked from itself
          + try

          { + zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, + CreateMode.PERSISTENT); + Assert.fail("create /zk" + leader + " should have failed"); + }

          catch (KeeperException e) {}
          +
          + // just make sure that we actually did get it in process at the
          + // leader
          + Assert.assertTrue(outstanding.size() == 1);
          + Proposal p = (Proposal) outstanding.values().iterator().next();
          + Assert.assertTrue(p.request.getHdr().getType() == OpCode.create);
          +
          + // make sure it has a chance to write it to disk
          + Thread.sleep(1000);
          + p.qvAcksetPairs.get(0).getAckset().contains(leader);
          +
          + // 6. wait the leader to quit due to no enough followers
          + waitForOne(zk[leader], States.CONNECTING);
          +
          + int newLeader = -1;
          + for (int i = 0; i < SERVER_COUNT; i++) {
          + if (mt[i].main.quorumPeer.leader != null)

          { + newLeader = i; + }

          + }
          + // make sure a different leader was elected
          + Assert.assertTrue(newLeader != leader);
          +
          + // 7. restart the previous leader
          + mt[leader].shutdown();
          + waitForOne(zk[leader], States.CONNECTING);
          + mt[leader].start();
          + waitForOne(zk[leader], States.CONNECTED);
          +
          + // 8. check the node exist in previous leader but not others
          + // make sure everything is consistent
          + for (int i = 0; i < SERVER_COUNT; i++) {
          + Assert.assertTrue("server " + i + " should not have /zk" + leader, zk[i].exists("/zk" + leader, false) == null);
          — End diff –

          The test will fail here as the node exist in previous leader.

          Show
          githubbot ASF GitHub Bot added a comment - Github user lvfangmin commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/310#discussion_r127567852 — Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java — @@ -784,4 +784,126 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } + @Test + public void testTxnAheadSnapInRetainDB() throws Exception { + // 1. start up server and wait for leader election to finish + ClientBase.setupTestEnv(); + final int SERVER_COUNT = 3; + final int clientPorts[] = new int [SERVER_COUNT] ; + StringBuilder sb = new StringBuilder(); + for(int i = 0; i < SERVER_COUNT; i++) { + clientPorts[i] = PortAssignment.unique(); + sb.append("server."+i+"=127.0.0.1:"+PortAssignment.unique()+":"+PortAssignment.unique()+";"+clientPorts[i]+"\n"); + } + String quorumCfgSection = sb.toString(); + + MainThread mt[] = new MainThread [SERVER_COUNT] ; + ZooKeeper zk[] = new ZooKeeper [SERVER_COUNT] ; + for (int i = 0; i < SERVER_COUNT; i++) { + mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); + mt[i].start(); + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + } + + waitForAll(zk, States.CONNECTED); + + // we need to shutdown and start back up to make sure that the create session isn't the first transaction since + // that is rather innocuous. + for (int i = 0; i < SERVER_COUNT; i++) { + mt[i].shutdown(); + } + + waitForAll(zk, States.CONNECTING); + + for (int i = 0; i < SERVER_COUNT; i++) { + mt[i].start(); + // Recreate a client session since the previous session was not persisted. + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + } + + waitForAll(zk, States.CONNECTED); + + + // 2. kill all followers + int leader = -1; + Map<Long, Proposal> outstanding = null; + for (int i = 0; i < SERVER_COUNT; i++) { + if (mt [i] .main.quorumPeer.leader != null) { + leader = i; + outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; + // increase the tick time to delay the leader going to looking + mt[leader].main.quorumPeer.tickTime = 10000; + } + } + + for (int i = 0; i < SERVER_COUNT; i++) { + if (i != leader) { + mt[i].shutdown(); + } + } + + // 3. start up the followers to form a new quorum + for (int i = 0; i < SERVER_COUNT; i++) { + if (i != leader) { + mt[i].start(); + } + } + + // 4. wait one of the follower to be the leader + for (int i = 0; i < SERVER_COUNT; i++) { + if (i != leader) { + // Recreate a client session since the previous session was not persisted. + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + waitForOne(zk[i], States.CONNECTED); + } + } + + // 5. send a create request to leader and make sure it's synced to disk, + // which means it acked from itself + try { + zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, + CreateMode.PERSISTENT); + Assert.fail("create /zk" + leader + " should have failed"); + } catch (KeeperException e) {} + + // just make sure that we actually did get it in process at the + // leader + Assert.assertTrue(outstanding.size() == 1); + Proposal p = (Proposal) outstanding.values().iterator().next(); + Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + + // make sure it has a chance to write it to disk + Thread.sleep(1000); + p.qvAcksetPairs.get(0).getAckset().contains(leader); + + // 6. wait the leader to quit due to no enough followers + waitForOne(zk [leader] , States.CONNECTING); + + int newLeader = -1; + for (int i = 0; i < SERVER_COUNT; i++) { + if (mt [i] .main.quorumPeer.leader != null) { + newLeader = i; + } + } + // make sure a different leader was elected + Assert.assertTrue(newLeader != leader); + + // 7. restart the previous leader + mt [leader] .shutdown(); + waitForOne(zk [leader] , States.CONNECTING); + mt [leader] .start(); + waitForOne(zk [leader] , States.CONNECTED); + + // 8. check the node exist in previous leader but not others + // make sure everything is consistent + for (int i = 0; i < SERVER_COUNT; i++) { + Assert.assertTrue("server " + i + " should not have /zk" + leader, zk [i] .exists("/zk" + leader, false) == null); — End diff – The test will fail here as the node exist in previous leader.
          Hide
          hanm Michael Han added a comment -

          Thanks for reporting this issue Fangmin Lv.

          C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync

          Are you saying leader B is sending a DIFF to follower C in this case? Since B does not have T1, I think it should send a TRUNC and C should drop T1 in its txn log.

          Show
          hanm Michael Han added a comment - Thanks for reporting this issue Fangmin Lv . C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync Are you saying leader B is sending a DIFF to follower C in this case? Since B does not have T1, I think it should send a TRUNC and C should drop T1 in its txn log.
          Hide
          lvfangmin Fangmin Lv added a comment -

          Michael Han T1 only exists in txn file but hasn't been applied to the data tree yet, the lastProcessedZxid in follower C is T0, so no TRUNC message when sync with leader.

          Show
          lvfangmin Fangmin Lv added a comment - Michael Han T1 only exists in txn file but hasn't been applied to the data tree yet, the lastProcessedZxid in follower C is T0, so no TRUNC message when sync with leader.
          Hide
          hanm Michael Han added a comment - - edited

          Make sense to me. I think previously we don't have this issue because the zkDb was cleared across leader election, and C will find out its lastProcessedZxid is T1, rather than T0, while reinitialize zkDb from snap/tnx log which will yield a TRUNC instead of DIFF from leader B.

          Show
          hanm Michael Han added a comment - - edited Make sense to me. I think previously we don't have this issue because the zkDb was cleared across leader election, and C will find out its lastProcessedZxid is T1, rather than T0, while reinitialize zkDb from snap/tnx log which will yield a TRUNC instead of DIFF from leader B.
          Hide
          hanm Michael Han added a comment -

          Fangmin Lv Any plan to submit your retain db implementation? This is an important bug to fix.

          Show
          hanm Michael Han added a comment - Fangmin Lv Any plan to submit your retain db implementation? This is an important bug to fix.
          Hide
          lvfangmin Fangmin Lv added a comment -

          Michael Han we've just finished and tried to test the RetainDB in our internal ensemble, might submit the code for review next week.

          Show
          lvfangmin Fangmin Lv added a comment - Michael Han we've just finished and tried to test the RetainDB in our internal ensemble, might submit the code for review next week.
          Hide
          lvfangmin Fangmin Lv added a comment -

          The internal patch has been stabilized, which have been tested for a long time, we've rolled it out to one of the production environment last week. Joseph from our team will attach the patch here for review this week.

          Show
          lvfangmin Fangmin Lv added a comment - The internal patch has been stabilized, which have been tested for a long time, we've rolled it out to one of the production environment last week. Joseph from our team will attach the patch here for review this week.
          Hide
          hanm Michael Han added a comment -

          Thanks for the update, Fangmin Lv. Good to know the patch is tested in prod environment!

          Show
          hanm Michael Han added a comment - Thanks for the update, Fangmin Lv . Good to know the patch is tested in prod environment!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user revans2 commented on the issue:

          https://github.com/apache/zookeeper/pull/310

          @lvfangmin any update on getting a pull request for the actual fix?

          Show
          githubbot ASF GitHub Bot added a comment - Github user revans2 commented on the issue: https://github.com/apache/zookeeper/pull/310 @lvfangmin any update on getting a pull request for the actual fix?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user lvfangmin commented on the issue:

          https://github.com/apache/zookeeper/pull/310

          @revans2 my teammate was working on the fix, and he was planning run it on prod for a while before sending out the diff. I'll sync with him today about the status.

          Show
          githubbot ASF GitHub Bot added a comment - Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/310 @revans2 my teammate was working on the fix, and he was planning run it on prod for a while before sending out the diff. I'll sync with him today about the status.

            People

            • Assignee:
              lvfangmin Fangmin Lv
              Reporter:
              lvfangmin Fangmin Lv
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development