Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-1549

Data inconsistency when follower is receiving a DIFF with a dirty snapshot

    Details

    • Type: Bug
    • Status: Open
    • Priority: Blocker
    • Resolution: Unresolved
    • Affects Version/s: 3.4.3
    • Fix Version/s: 3.5.4, 3.6.0
    • Component/s: quorum
    • Labels:
      None

      Description

      the trunc code (from ZOOKEEPER-1154?) cannot work correct if the snapshot is not correct.
      here is scenario(similar to 1154):
      Initial Condition
      1. Lets say there are three nodes in the ensemble A,B,C with A being the leader
      2. The current epoch is 7.
      3. For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
      4. The zxid is 73
      5. All the nodes have seen the change 73 and have persistently logged it.
      Step 1
      Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
      Step 2
      A,B restart, A is elected as the new leader, and A will load data and take a clean snapshot(change 74 is in it), then send diff to B, but B died before sync with A. A died later.
      Step 3
      B,C restart, A is still down
      B,C form the quorum
      B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73
      epoch is now 8, zxid is 80
      Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
      Step 4
      A starts up. It applies the change in request with zxid 74 to its in-memory data tree
      A contacts B to registerAsFollower and provides 74 as its ZxId
      Since 71<=74<=81, B decides to send A the diff.
      Problem:
      The problem with the above sequence is that after truncate the log, A will load the snapshot again which is not correct.

      In 3.3 branch, FileTxnSnapLog.restore does not call listener(ZOOKEEPER-874), the leader will send a snapshot to follower, it will not be a problem.

      1. case.patch
        6 kB
        Jacky007
      2. ZOOKEEPER-1549-3.4.patch
        134 kB
        Thawan Kooburat
      3. ZOOKEEPER-1549-learner.patch
        8 kB
        Flavio Junqueira

        Issue Links

          Activity

          Hide
          fanster.z Jacky007 added a comment -

          I've written a test case to reproduce it. I think we could delete the snapshot larger than the truncate zxid to solve the problem.

          Show
          fanster.z Jacky007 added a comment - I've written a test case to reproduce it. I think we could delete the snapshot larger than the truncate zxid to solve the problem.
          Hide
          fpj Flavio Junqueira added a comment -

          Thanks for reporting this issue. I'm not sure your assessment is correct, though. If the problem you're mentioning really exists, then it seems wrong to me that A takes a snapshot including 74 without getting an acknowledgment from B. A needs to make sure that B has 74 before making the state final, and A is essentially committing 74 (without a quorum) by generating a snapshot that contains it.

          Deleting a snapshot sounds wrong to me, and I'd rather not do it even if it turns out to be a valid solution.

          Show
          fpj Flavio Junqueira added a comment - Thanks for reporting this issue. I'm not sure your assessment is correct, though. If the problem you're mentioning really exists, then it seems wrong to me that A takes a snapshot including 74 without getting an acknowledgment from B. A needs to make sure that B has 74 before making the state final, and A is essentially committing 74 (without a quorum) by generating a snapshot that contains it. Deleting a snapshot sounds wrong to me, and I'd rather not do it even if it turns out to be a valid solution.
          Hide
          fanster.z Jacky007 added a comment -

          Thanks for Falvio to comment. I agree with Falvio, deleting a snapshot is too dangerous to be an optional. There should be another solution.
          Step 1
          74 is in A's transaction logs.
          Step 2
          A is the new leader, and it will execute the following code.

          void lead() throws IOException, InterruptedException {
                  self.end_fle = System.currentTimeMillis();
                  LOG.info("LEADING - LEADER ELECTION TOOK - " +
                        (self.end_fle - self.start_fle));
                  self.start_fle = 0;
                  self.end_fle = 0;
          
                  zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean);
          
                  try {
                      self.tick = 0;
                      zk.loadData();
          

          Then A will load its snapshot and committedlog.

              public void loadData() throws IOException, InterruptedException {
                  setZxid(zkDb.loadDataBase());
                  // Clean up dead sessions
                  LinkedList<Long> deadSessions = new LinkedList<Long>();
                  for (Long session : zkDb.getSessions()) {
                      if (zkDb.getSessionWithTimeOuts().get(session) == null) {
                          deadSessions.add(session);
                      }
                  }
                  zkDb.setDataTreeInit(true);
                  for (long session : deadSessions) {
                      // XXX: Is lastProcessedZxid really the best thing to use?
                      killSession(session, zkDb.getDataTreeLastProcessedZxid());
                  }
          
                  // Make a clean snapshot
                  takeSnapshot();
              }
          

          when A takeSnapshot(), 74 is in it(if A dies after that, B will never know it). When A load database,

              public void loadData() throws IOException, InterruptedException {
                  setZxid(zkDb.loadDataBase());
          

          it will restore database from snapshots and transaction logs,

          long zxid = snapLog.restore(dataTree,sessionsWithTimeouts,listener);
          
                      try {
                          processTransaction(hdr,dt,sessions, itr.getTxn());
                      } catch(KeeperException.NoNodeException e) {
                         throw new IOException("Failed to process transaction type: " +
                               hdr.getType() + " error: " + e.getMessage(), e);
                      }
                      listener.onTxnLoaded(hdr, itr.getTxn());
          

          but 74 is in A's transaction logs.

          Show
          fanster.z Jacky007 added a comment - Thanks for Falvio to comment. I agree with Falvio, deleting a snapshot is too dangerous to be an optional. There should be another solution. Step 1 74 is in A's transaction logs. Step 2 A is the new leader, and it will execute the following code. void lead() throws IOException, InterruptedException { self.end_fle = System.currentTimeMillis(); LOG.info("LEADING - LEADER ELECTION TOOK - " + (self.end_fle - self.start_fle)); self.start_fle = 0; self.end_fle = 0; zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean); try { self.tick = 0; zk.loadData(); Then A will load its snapshot and committedlog. public void loadData() throws IOException, InterruptedException { setZxid(zkDb.loadDataBase()); // Clean up dead sessions LinkedList<Long> deadSessions = new LinkedList<Long>(); for (Long session : zkDb.getSessions()) { if (zkDb.getSessionWithTimeOuts().get(session) == null) { deadSessions.add(session); } } zkDb.setDataTreeInit(true); for (long session : deadSessions) { // XXX: Is lastProcessedZxid really the best thing to use? killSession(session, zkDb.getDataTreeLastProcessedZxid()); } // Make a clean snapshot takeSnapshot(); } when A takeSnapshot(), 74 is in it(if A dies after that, B will never know it). When A load database, public void loadData() throws IOException, InterruptedException { setZxid(zkDb.loadDataBase()); it will restore database from snapshots and transaction logs, long zxid = snapLog.restore(dataTree,sessionsWithTimeouts,listener); try { processTransaction(hdr,dt,sessions, itr.getTxn()); } catch(KeeperException.NoNodeException e) { throw new IOException("Failed to process transaction type: " + hdr.getType() + " error: " + e.getMessage(), e); } listener.onTxnLoaded(hdr, itr.getTxn()); but 74 is in A's transaction logs.
          Hide
          fanster.z Jacky007 added a comment -

          a testcase

          Show
          fanster.z Jacky007 added a comment - a testcase
          Hide
          fanster.z Jacky007 added a comment -

          Flavio Junqueira, any suggestion to this?

          Show
          fanster.z Jacky007 added a comment - Flavio Junqueira, any suggestion to this?
          Hide
          fpj Flavio Junqueira added a comment -

          Hi Jacky007, I'm sorry, I missed your comment. I'll have a look at it shortly.

          Show
          fpj Flavio Junqueira added a comment - Hi Jacky007, I'm sorry, I missed your comment. I'll have a look at it shortly.
          Hide
          fpj Flavio Junqueira added a comment -

          I'm not done with my analysis yet, but my understanding is that B (leader) should be sending a TRUNC message to A (prospective follower), since this predicate (prevProposalZxid < peerLastZxid) in LearnerHandler holds for the scenario described. Process A however should fail because it can't truncate.

          Taking a snapshot this early in the lead() method does not sound like the right thing to do, since the leader does not have quorum support yet.

          Show
          fpj Flavio Junqueira added a comment - I'm not done with my analysis yet, but my understanding is that B (leader) should be sending a TRUNC message to A (prospective follower), since this predicate (prevProposalZxid < peerLastZxid) in LearnerHandler holds for the scenario described. Process A however should fail because it can't truncate. Taking a snapshot this early in the lead() method does not sound like the right thing to do, since the leader does not have quorum support yet.
          Hide
          fanster.z Jacky007 added a comment -

          Proposal which have quorum support will be wrote to memory database normally. When a server start, all proposals will be write to memory(this may be the root of the problem), correctness is based on the hypothesis that those proposals don't have quorum support will be truncated before the server provides service.

          If we can not guarantee that all data in memory has quorum support in the start-up procedure, we should guarantee that taking a dirty snapshot will never happen.

          Which means taking a snapshot when the prospective follower receiving a UPTODATE message is also wrong.

          When the prospective leader has a quorum set ACK for UPTODATE, it can take a snapshot. But the prospective follow will never know such things in the start-up procedure.

          So the prospective follower can not take a snapshot here.

                          case Leader.UPTODATE:
                              zk.takeSnapshot();
          

          Is it really indispensability here?

          What is the motivation of taking a snapshot in the lead() method?

          Show
          fanster.z Jacky007 added a comment - Proposal which have quorum support will be wrote to memory database normally. When a server start, all proposals will be write to memory(this may be the root of the problem), correctness is based on the hypothesis that those proposals don't have quorum support will be truncated before the server provides service. If we can not guarantee that all data in memory has quorum support in the start-up procedure, we should guarantee that taking a dirty snapshot will never happen. Which means taking a snapshot when the prospective follower receiving a UPTODATE message is also wrong. When the prospective leader has a quorum set ACK for UPTODATE, it can take a snapshot. But the prospective follow will never know such things in the start-up procedure. So the prospective follower can not take a snapshot here. case Leader.UPTODATE: zk.takeSnapshot(); Is it really indispensability here? What is the motivation of taking a snapshot in the lead() method?
          Hide
          fpj Flavio Junqueira added a comment -

          If we can not guarantee that all data in memory has quorum support in the start-up procedure, we should guarantee that taking a dirty snapshot will never happen.

          Agreed, that's the point I tried to make before. Taking a snapshot over state that includes uncommitted transactions does not sound right.

          Which means taking a snapshot when the prospective follower receiving a UPTODATE message is also wrong.

          A follower can only take a snapshot once it learns that it has all been committed.

          What is the motivation of taking a snapshot in the lead() method?

          I don't think it is there for correctness. Frequent snapshots are good to reduce the time to recover and it sounds reasonable to take a snapshot when starting a new epoch.

          Show
          fpj Flavio Junqueira added a comment - If we can not guarantee that all data in memory has quorum support in the start-up procedure, we should guarantee that taking a dirty snapshot will never happen. Agreed, that's the point I tried to make before. Taking a snapshot over state that includes uncommitted transactions does not sound right. Which means taking a snapshot when the prospective follower receiving a UPTODATE message is also wrong. A follower can only take a snapshot once it learns that it has all been committed. What is the motivation of taking a snapshot in the lead() method? I don't think it is there for correctness. Frequent snapshots are good to reduce the time to recover and it sounds reasonable to take a snapshot when starting a new epoch.
          Hide
          fpj Flavio Junqueira added a comment - - edited

          I was thinking that it would be ok to remove that takeSnapshot() call from loadData(). It is incorrect as we have been discussing, and it doesn't seem to break anything, although a couple of tests failed when I gave it a try. I haven't looked at the reasons for the tests to fail; I'll do it next.

          Jacky007 Which version of ZooKeeper are you using? The case I see in trunk looks like this:

          case Leader.UPTODATE:
                              if (!snapshotTaken) { // true for the pre v1.0 case
                                  zk.takeSnapshot();
                                  self.setCurrentEpoch(newEpoch);
                              }
          
          Show
          fpj Flavio Junqueira added a comment - - edited I was thinking that it would be ok to remove that takeSnapshot() call from loadData(). It is incorrect as we have been discussing, and it doesn't seem to break anything, although a couple of tests failed when I gave it a try. I haven't looked at the reasons for the tests to fail; I'll do it next. Jacky007 Which version of ZooKeeper are you using? The case I see in trunk looks like this: case Leader.UPTODATE: if (!snapshotTaken) { // true for the pre v1.0 case zk.takeSnapshot(); self.setCurrentEpoch(newEpoch); }
          Hide
          fanster.z Jacky007 added a comment -

          Sorry, the code is from 3.3.x branch. Could you tell me which jira is for the above change?

          In 3.4.3, taking snapshot when the prospective follower receives a NEWLEADER message sounds not right.
          I think we should do that the prospective follower receives a UPTODATE message, since data has quorum support yet.

          Show
          fanster.z Jacky007 added a comment - Sorry, the code is from 3.3.x branch. Could you tell me which jira is for the above change? In 3.4.3, taking snapshot when the prospective follower receives a NEWLEADER message sounds not right. I think we should do that the prospective follower receives a UPTODATE message, since data has quorum support yet.
          Hide
          fpj Flavio Junqueira added a comment -

          Could you tell me which jira is for the above change?

          I believe it was in the patch for this jira ZOOKEEPER-1282.

          I think we should do that the prospective follower receives a UPTODATE message, since data has quorum support yet.

          I'm no sure what you're proposing here. My proposal is still that we remove the call to takeSnapshot in loadData. Do you see a problem with doing it?

          Show
          fpj Flavio Junqueira added a comment - Could you tell me which jira is for the above change? I believe it was in the patch for this jira ZOOKEEPER-1282 . I think we should do that the prospective follower receives a UPTODATE message, since data has quorum support yet. I'm no sure what you're proposing here. My proposal is still that we remove the call to takeSnapshot in loadData. Do you see a problem with doing it?
          Hide
          fanster.z Jacky007 added a comment -

          Hi Flavio Junqueira, I'm sorry, I had a long holiday and didn't reply. https://issues.apache.org/jira/browse/ZOOKEEPER-1558 looks good to me.
          But https://issues.apache.org/jira/browse/ZOOKEEPER-1559 seems rather complicated. If we move the zk.takeSnapshot() method to UPTODATE, it will break Zab1.0 Phase2.4.

          Learner should not snapshot uncommitted state before it have a quorum support, but if Learner does not snapshot, the uncommitted state will never have a quorum support since it does not have persistent storage. If Learner does, then it need some features as deleting snapshot unwanted, then we go back to the origin.

          Any ideas?

          Show
          fanster.z Jacky007 added a comment - Hi Flavio Junqueira, I'm sorry, I had a long holiday and didn't reply. https://issues.apache.org/jira/browse/ZOOKEEPER-1558 looks good to me. But https://issues.apache.org/jira/browse/ZOOKEEPER-1559 seems rather complicated. If we move the zk.takeSnapshot() method to UPTODATE, it will break Zab1.0 Phase2.4. Learner should not snapshot uncommitted state before it have a quorum support, but if Learner does not snapshot, the uncommitted state will never have a quorum support since it does not have persistent storage. If Learner does, then it need some features as deleting snapshot unwanted, then we go back to the origin. Any ideas?
          Hide
          thawan Thawan Kooburat added a comment -

          If we let the leader know what zxid is considered committed by the majority of the quorum (we have this information as part of the leader election process). Then, we just need to truncate the leader's txn log to that zxid before starting it.

          In this case, we don't have to worry that we might have uncommitted txns in the snapshot in both the leader and the follower.

          Show
          thawan Thawan Kooburat added a comment - If we let the leader know what zxid is considered committed by the majority of the quorum (we have this information as part of the leader election process). Then, we just need to truncate the leader's txn log to that zxid before starting it. In this case, we don't have to worry that we might have uncommitted txns in the snapshot in both the leader and the follower.
          Hide
          fanster.z Jacky007 added a comment -

          Then, we just need to truncate the leader's txn log to that zxid before starting it.

          We can not do that for consistency issues.

          Show
          fanster.z Jacky007 added a comment - Then, we just need to truncate the leader's txn log to that zxid before starting it. We can not do that for consistency issues.
          Hide
          fpj Flavio Junqueira added a comment -

          Then, we just need to truncate the leader's txn log to that zxid before starting it.

          Also, we can't truncate a snapshot. We only truncate a suffix of the transaction log.

          Show
          fpj Flavio Junqueira added a comment - Then, we just need to truncate the leader's txn log to that zxid before starting it. Also, we can't truncate a snapshot. We only truncate a suffix of the transaction log.
          Hide
          thawan Thawan Kooburat added a comment -

          Can you explain a bit more about consistency issue? If we taking the snapshot correctly, the uncommitted txn should never be in the snapshot when a participant is elected as a leader, so we can truncate uncommitted txns from its txnlog.

          Show
          thawan Thawan Kooburat added a comment - Can you explain a bit more about consistency issue? If we taking the snapshot correctly, the uncommitted txn should never be in the snapshot when a participant is elected as a leader, so we can truncate uncommitted txns from its txnlog.
          Hide
          fanster.z Jacky007 added a comment -

          A B C D E, A propose v, B C accept it, then A B died. C is elected as the new leader. We cannot discard the proposal v.

          Show
          fanster.z Jacky007 added a comment - A B C D E, A propose v, B C accept it, then A B died. C is elected as the new leader. We cannot discard the proposal v.
          Hide
          thawan Thawan Kooburat added a comment -

          Thanks for the explanation. Since, we cannot differentiate between committed and uncommitted txn, so we have to treat all of them as committed. However, because we don't do quorum voting on these txns, we ran into problem that we are snapshotting with uncommitted state.

          Does this suggest that we really need is to do another quorum voting on these grey txns? If we don't, then we have to make the current follower sync-up behave like voting.

          During normal operation, follower vote by writing to txnlog. If the quorum fail, we can truncate the log to abort that txn. However, during sync-up via snapshot, we cannot about the txn unless we remove the snapshot.

          Show
          thawan Thawan Kooburat added a comment - Thanks for the explanation. Since, we cannot differentiate between committed and uncommitted txn, so we have to treat all of them as committed. However, because we don't do quorum voting on these txns, we ran into problem that we are snapshotting with uncommitted state. Does this suggest that we really need is to do another quorum voting on these grey txns? If we don't, then we have to make the current follower sync-up behave like voting. During normal operation, follower vote by writing to txnlog. If the quorum fail, we can truncate the log to abort that txn. However, during sync-up via snapshot, we cannot about the txn unless we remove the snapshot.
          Hide
          fpj Flavio Junqueira added a comment - - edited

          I don't think major changes are needed, at least for the leader case. We simply shouldn't be taking snapshots over uncommitted state. Check ZOOKEEPER-1558 and ZOOKEEPER-1559, subtasks of this jira.

          Show
          fpj Flavio Junqueira added a comment - - edited I don't think major changes are needed, at least for the leader case. We simply shouldn't be taking snapshots over uncommitted state. Check ZOOKEEPER-1558 and ZOOKEEPER-1559 , subtasks of this jira.
          Hide
          thawan Thawan Kooburat added a comment -

          This may require more changes, but it based on that the 2 ideas

          1. We should never let uncommitted changes reach DataTree so we don't have to worry about snapshot have uncommitted stage.
          2. Follower need to log uncommitted txn sent by the leader to its txnlog during synchronization. This mimic quorum voting and guarantee correctness.

          Here is how we may implement it:

          Before leader start serving request, there is no need for the leader to replay its txnlog at all. It only need to know the latest zxid in order to be elected as a leader and synchronize with the follower. The leader need to send uncommitted txns before sending NEWLEADER packet. Even if a follower request a snapshot, it won't get uncommitted txn as part of the snapshot. Then, the follower need to log the steam of txns between DIFF/SNAP and NEWLEADER to disk before sending ACK back to the leader. When the leader receive majority of ACK, then it can apply the uncommitted txn and start serving request.

          Show
          thawan Thawan Kooburat added a comment - This may require more changes, but it based on that the 2 ideas 1. We should never let uncommitted changes reach DataTree so we don't have to worry about snapshot have uncommitted stage. 2. Follower need to log uncommitted txn sent by the leader to its txnlog during synchronization. This mimic quorum voting and guarantee correctness. Here is how we may implement it: Before leader start serving request, there is no need for the leader to replay its txnlog at all. It only need to know the latest zxid in order to be elected as a leader and synchronize with the follower. The leader need to send uncommitted txns before sending NEWLEADER packet. Even if a follower request a snapshot, it won't get uncommitted txn as part of the snapshot. Then, the follower need to log the steam of txns between DIFF/SNAP and NEWLEADER to disk before sending ACK back to the leader. When the leader receive majority of ACK, then it can apply the uncommitted txn and start serving request.
          Hide
          fpj Flavio Junqueira added a comment -

          Thawan Kooburat If I understand your proposal, it sounds like it works. All we need to guarantee is that a quorum has persisted and acknowledged each transaction the leader has reflected in its initial state. But, I'm not convinced that it is worth changing the way we currently do it. It achieves the same goal and the one thing we need to fix is that we can't snapshot state that contains uncommitted transactions. One particular issue with changing the requests to snapshot during recovery is that we have some code involving snapshots that is there for backward compatibility (see ZOOKEEPER-1559). I haven't been able to look closely at ZOOKEEPER-1559, though.

          Show
          fpj Flavio Junqueira added a comment - Thawan Kooburat If I understand your proposal, it sounds like it works. All we need to guarantee is that a quorum has persisted and acknowledged each transaction the leader has reflected in its initial state. But, I'm not convinced that it is worth changing the way we currently do it. It achieves the same goal and the one thing we need to fix is that we can't snapshot state that contains uncommitted transactions. One particular issue with changing the requests to snapshot during recovery is that we have some code involving snapshots that is there for backward compatibility (see ZOOKEEPER-1559 ). I haven't been able to look closely at ZOOKEEPER-1559 , though.
          Hide
          thawan Thawan Kooburat added a comment -

          I agree that your current solution works in the leader side, but I am not sure how you plan to fix the problem on the learner side.

          Even if we only need to consider Zap1.0 protocol. The learner need to commit its state (and history) when it receive NEWLEADER packet and send ACK back to the LEADER. However, we cannot write take the snapshot now since it may contain uncommitted state. Are we going to take the snapshot after getting UPTODATE from the leader instead?

          Show
          thawan Thawan Kooburat added a comment - I agree that your current solution works in the leader side, but I am not sure how you plan to fix the problem on the learner side. Even if we only need to consider Zap1.0 protocol. The learner need to commit its state (and history) when it receive NEWLEADER packet and send ACK back to the LEADER. However, we cannot write take the snapshot now since it may contain uncommitted state. Are we going to take the snapshot after getting UPTODATE from the leader instead?
          Hide
          fpj Flavio Junqueira added a comment -

          Thawan Kooburat I'm sorry for not getting back to this before. I've been investigating this issue on and off, though. After thinking more carefully about the problem, your rough plan seems good to me. The key point there is that no snapshot should contain uncommitted state, but we can't actually avoid having followers saving a snapshot to disk because they need to persist the txns they have accepted. We also can't discard snapshots, since we could be discarding implicitly transactions that the follower has accepted.

          To review how I believe we need to do this, there are now three possibilities for what a follower can get (ignoring requests to truncate):

          1. A diff of txns;
          2. A snapshot of the leader state;
          3. A snapshot + a diff.

          The first two are part of the protocol today, but the third is not. One way is to create a new message. Another way, which seems good to me right now, is to collapse 2 and 3. On the follower side, we can save any snapshot it receives right away, since we are assuming that any snapshot contains only committed state. If there are more transactions, then it receives and logs them. It is ok to apply them to the data tree if we guarantee that we won't have snapshots until we receive the UPTODATE message (an acknowledgement that the leader has enough support).

          Let me know your thoughts, please.

          Show
          fpj Flavio Junqueira added a comment - Thawan Kooburat I'm sorry for not getting back to this before. I've been investigating this issue on and off, though. After thinking more carefully about the problem, your rough plan seems good to me. The key point there is that no snapshot should contain uncommitted state, but we can't actually avoid having followers saving a snapshot to disk because they need to persist the txns they have accepted. We also can't discard snapshots, since we could be discarding implicitly transactions that the follower has accepted. To review how I believe we need to do this, there are now three possibilities for what a follower can get (ignoring requests to truncate): A diff of txns; A snapshot of the leader state; A snapshot + a diff. The first two are part of the protocol today, but the third is not. One way is to create a new message. Another way, which seems good to me right now, is to collapse 2 and 3. On the follower side, we can save any snapshot it receives right away, since we are assuming that any snapshot contains only committed state. If there are more transactions, then it receives and logs them. It is ok to apply them to the data tree if we guarantee that we won't have snapshots until we receive the UPTODATE message (an acknowledgement that the leader has enough support). Let me know your thoughts, please.
          Hide
          fanster.z Jacky007 added a comment -

          The problem is when follower receives a snapshot from leader, it must ack to leader after taks a snapshot(no txnlog).

          Show
          fanster.z Jacky007 added a comment - The problem is when follower receives a snapshot from leader, it must ack to leader after taks a snapshot(no txnlog).
          Hide
          fanster.z Jacky007 added a comment -
          Show
          fanster.z Jacky007 added a comment -
          Hide
          fanster.z Jacky007 added a comment -

          I think Flavio says is the same to Thawan Kooburat.

          Before leader start serving request, there is no need for the leader to replay its txnlog at all. It only need to know the latest zxid in order to be elected as a leader and synchronize with the follower. The leader need to send uncommitted txns before sending NEWLEADER packet. Even if a follower request a snapshot, it won't get uncommitted txn as part of the snapshot. Then, the follower need to log the steam of txns between DIFF/SNAP and NEWLEADER to disk before sending ACK back to the leader. When the leader receive majority of ACK, then it can apply the uncommitted txn and start serving request.

          The problem is the leader can only send a oold snapshot to the follower, related to snapShot. or we should save lastProcessedZxid.

          Show
          fanster.z Jacky007 added a comment - I think Flavio says is the same to Thawan Kooburat. Before leader start serving request, there is no need for the leader to replay its txnlog at all. It only need to know the latest zxid in order to be elected as a leader and synchronize with the follower. The leader need to send uncommitted txns before sending NEWLEADER packet. Even if a follower request a snapshot, it won't get uncommitted txn as part of the snapshot. Then, the follower need to log the steam of txns between DIFF/SNAP and NEWLEADER to disk before sending ACK back to the leader. When the leader receive majority of ACK, then it can apply the uncommitted txn and start serving request. The problem is the leader can only send a oold snapshot to the follower, related to snapShot. or we should save lastProcessedZxid.
          Hide
          fpj Flavio Junqueira added a comment -

          I think Flavio says is the same to Thawan Kooburat.

          I had agreed that Thawan Kooburat was good already and I just added more detail about the changes to the protocol. One important change is that a follower may receive a combination of a snapshot and transactions. Right now in trunk, it is either a snapshot or a diff.

          The problem is the leader can only send a oold snapshot to the follower, related to snapShot. or we should save lastProcessedZxid.

          The leader sends the latest snapshot it has and a suffix, potentially not empty, of uncommitted transactions. It doesn't matter if the snapshot is old as long as it only contains committed state. I don't a reason for saving lastProcessedZxid.

          Show
          fpj Flavio Junqueira added a comment - I think Flavio says is the same to Thawan Kooburat. I had agreed that Thawan Kooburat was good already and I just added more detail about the changes to the protocol. One important change is that a follower may receive a combination of a snapshot and transactions. Right now in trunk, it is either a snapshot or a diff. The problem is the leader can only send a oold snapshot to the follower, related to snapShot. or we should save lastProcessedZxid. The leader sends the latest snapshot it has and a suffix, potentially not empty, of uncommitted transactions. It doesn't matter if the snapshot is old as long as it only contains committed state. I don't a reason for saving lastProcessedZxid.
          Hide
          phunt Patrick Hunt added a comment -

          I'm not sure if it's related but I've gotten two reports now from 3.4.4/3.4.5 users that have seen corrupted clusters.

          In one case the user had over 3000 leader elections (virtualized environment), ended up with an ephemeral that was owned by a "gone" session on one of the 3 servers. The other 2 servers had the znode but it was owned by an active session.

          In the other case there were around 50 leader elections, ended up with 4 of 5 servers with /foo and the fifth didn't have /foo at all.

          It would be great if we could prioritize this and put out a 3.4.6 release asap.

          Show
          phunt Patrick Hunt added a comment - I'm not sure if it's related but I've gotten two reports now from 3.4.4/3.4.5 users that have seen corrupted clusters. In one case the user had over 3000 leader elections (virtualized environment), ended up with an ephemeral that was owned by a "gone" session on one of the 3 servers. The other 2 servers had the znode but it was owned by an active session. In the other case there were around 50 leader elections, ended up with 4 of 5 servers with /foo and the fifth didn't have /foo at all. It would be great if we could prioritize this and put out a 3.4.6 release asap.
          Hide
          thawan Thawan Kooburat added a comment -

          I don't think there is a need to create a new message to mark a new type of synchronization (SNAP+DIFF).If the follower sees any proposal between SNAP and NEWLEADER, it just need to log the proposals and only apply them after it see UPTODATE or the leader can explicitly send COMMIT of each of the proposal after it get NEWLEADER_ACK from the follower. Similar logic is already exists for any txns that comes between SNAP and UPTODATE so it won't be much more work.

          Here is 2 main reasons why I prefer this approach.
          1. The follower should not take a snapshot after UPTODATE. Since the leader already switch to use syncLimit timeout (+ tick counting) after UPTODATE is sent. In your proposal, the leader may tear down the quorum if syncLimit is not long enough to cover snapshotting time.

          2. We have several features (Read-only observer, replacing DataTree with key-value store) that we are planning to push upstream. These features relied on the assumption that any request passed the commit processor or reached the DataTree is considered committed. I will create JIRAs describing these features

          With this approach, I think the more complicate part is on the leader side. Since it need to hold uncommitted txns in memory and only apply them it get NEWLEADER_ACK from the follower.

          Show
          thawan Thawan Kooburat added a comment - I don't think there is a need to create a new message to mark a new type of synchronization (SNAP+DIFF).If the follower sees any proposal between SNAP and NEWLEADER, it just need to log the proposals and only apply them after it see UPTODATE or the leader can explicitly send COMMIT of each of the proposal after it get NEWLEADER_ACK from the follower. Similar logic is already exists for any txns that comes between SNAP and UPTODATE so it won't be much more work. Here is 2 main reasons why I prefer this approach. 1. The follower should not take a snapshot after UPTODATE. Since the leader already switch to use syncLimit timeout (+ tick counting) after UPTODATE is sent. In your proposal, the leader may tear down the quorum if syncLimit is not long enough to cover snapshotting time. 2. We have several features (Read-only observer, replacing DataTree with key-value store) that we are planning to push upstream. These features relied on the assumption that any request passed the commit processor or reached the DataTree is considered committed. I will create JIRAs describing these features With this approach, I think the more complicate part is on the leader side. Since it need to hold uncommitted txns in memory and only apply them it get NEWLEADER_ACK from the follower.
          Hide
          fournc Camille Fournier added a comment -

          I wanted to try and look at this but the test case provided doesn't replicate for me in the 3.4 branch. Is this something that will fail regularly or intermittently? I've run it several times now with no luck in replicating.

          Show
          fournc Camille Fournier added a comment - I wanted to try and look at this but the test case provided doesn't replicate for me in the 3.4 branch. Is this something that will fail regularly or intermittently? I've run it several times now with no luck in replicating.
          Hide
          fournc Camille Fournier added a comment -

          Ok I agree with all the great analysis you guys have done and think the proposed solution makes sense. The devil will be in writing a reproducible test case for this madness. Is anyone working on doing that?

          Show
          fournc Camille Fournier added a comment - Ok I agree with all the great analysis you guys have done and think the proposed solution makes sense. The devil will be in writing a reproducible test case for this madness. Is anyone working on doing that?
          Hide
          fpj Flavio Junqueira added a comment -

          Two points that are critical here for the correctness of the protocol is that the state is persisted in the right order and that the follower changes its current epoch when it has accepted the state of the leader is proposing. It is fine (and necessary even) to persist the transactions that the follower accepts, but as we have discussed already, they can't be part of a snapshot if they haven't been committed. Note that it is ok for a follower to accept these transactions even if it loses its connection to the prospective leader.

          Thawan Kooburat

          In your proposal, the leader may tear down the quorum if syncLimit is not long enough to cover snapshotting time.

          I'm not exactly sure of what part of the proposal you're referring to here, but if the follower does not receive a complete snapshot it should throw it away. In the case the snapshot is received fully but the subsequent diff isn't, it is still fine to persist the snapshot as long as the snapshot doesn't contain uncommitted state.

          Camille Fournier

          I wanted to try and look at this but the test case provided doesn't replicate for me in the 3.4 branch.

          Interesting, it does reproduce for me on trunk, reliably.

          The devil will be in writing a reproducible test case for this madness.

          There are possibly multiple ways of getting a dirty snapshot. I don't think a single test will cover all possibilities.

          Is anyone working on doing that?

          I have written no code so far, aside from doing a few tests with removing calls to take a snapshot here and there. I don't know if Thawan Kooburat has written any code.

          Thawan Kooburat are you interested in working on/providing a patch? Otherwise, I can work on it.

          Show
          fpj Flavio Junqueira added a comment - Two points that are critical here for the correctness of the protocol is that the state is persisted in the right order and that the follower changes its current epoch when it has accepted the state of the leader is proposing. It is fine (and necessary even) to persist the transactions that the follower accepts, but as we have discussed already, they can't be part of a snapshot if they haven't been committed. Note that it is ok for a follower to accept these transactions even if it loses its connection to the prospective leader. Thawan Kooburat In your proposal, the leader may tear down the quorum if syncLimit is not long enough to cover snapshotting time. I'm not exactly sure of what part of the proposal you're referring to here, but if the follower does not receive a complete snapshot it should throw it away. In the case the snapshot is received fully but the subsequent diff isn't, it is still fine to persist the snapshot as long as the snapshot doesn't contain uncommitted state. Camille Fournier I wanted to try and look at this but the test case provided doesn't replicate for me in the 3.4 branch. Interesting, it does reproduce for me on trunk, reliably. The devil will be in writing a reproducible test case for this madness. There are possibly multiple ways of getting a dirty snapshot. I don't think a single test will cover all possibilities. Is anyone working on doing that? I have written no code so far, aside from doing a few tests with removing calls to take a snapshot here and there. I don't know if Thawan Kooburat has written any code. Thawan Kooburat are you interested in working on/providing a patch? Otherwise, I can work on it.
          Hide
          fournc Camille Fournier added a comment -

          Yeah just tried it against trunk as well and it doesn't repro there either for me. I think some sort of testing in the Zab1_0Test model would be good. The more reproducible the better.

          Show
          fournc Camille Fournier added a comment - Yeah just tried it against trunk as well and it doesn't repro there either for me. I think some sort of testing in the Zab1_0Test model would be good. The more reproducible the better.
          Hide
          fpj Flavio Junqueira added a comment -

          I have actually produced a patch to solve the issue initially spotted in this jira and uploaded it to ZOOKEEPER-1558. That patch only partially solves the issue after all the discussion and analysis. In that patch, I have also moved the test to Zab1_0Test.

          The patch was stale, though, so I have just uploaded a new one that applies to trunk. I'm sorry for stating the obvious, but only apply the test changes to make sure that the test fails. It should fail reliably.

          Show
          fpj Flavio Junqueira added a comment - I have actually produced a patch to solve the issue initially spotted in this jira and uploaded it to ZOOKEEPER-1558 . That patch only partially solves the issue after all the discussion and analysis. In that patch, I have also moved the test to Zab1_0Test. The patch was stale, though, so I have just uploaded a new one that applies to trunk. I'm sorry for stating the obvious, but only apply the test changes to make sure that the test fails. It should fail reliably.
          Hide
          thawan Thawan Kooburat added a comment -

          Regarding the syncLimit, I was referring to this logic in the leader which is not part of Zab protocol but it is used to ensure that failure is detected in a timely fashion. We need to make sure that the follower start communicating within syncLimit after it get UPTODATE packet.

                        if (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) {
                          //if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) {
                              // Lost quorum, shutdown
                            // TODO: message is wrong unless majority quorums used
                              shutdown("Only " + syncedSet.size() + " followers, need "
                                      + (self.getVotingView().size() / 2));
                              // make sure the order is the same!
                              // the leader goes to looking
                              return;
                        }
          

          Not sure if it is reasonable to treat this as a blocker issue given the current release timeline for 3.5.0 and the amount of changes needed to fix this one. I am happy to help on the leader part.

          Show
          thawan Thawan Kooburat added a comment - Regarding the syncLimit, I was referring to this logic in the leader which is not part of Zab protocol but it is used to ensure that failure is detected in a timely fashion. We need to make sure that the follower start communicating within syncLimit after it get UPTODATE packet. if (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) { // if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) { // Lost quorum, shutdown // TODO: message is wrong unless majority quorums used shutdown( "Only " + syncedSet.size() + " followers, need " + (self.getVotingView().size() / 2)); // make sure the order is the same! // the leader goes to looking return ; } Not sure if it is reasonable to treat this as a blocker issue given the current release timeline for 3.5.0 and the amount of changes needed to fix this one. I am happy to help on the leader part.
          Hide
          fpj Flavio Junqueira added a comment -

          Thawan Kooburat Ok, I got a better clue of what you're referring to with the syncLimit comment, but I'm not there yet. syncLimit has always been a parameter that limits the amount of time a follower can take to catch up, so I'm not proposing any change to the semantics of syncLimit, just to make it clear.

          About having it for 3.5.0, I suggest we make it a blocker for 3.5.0. If necessary, I also suggest we delay the release to have it in, although certainly not ideal.

          Given that we will be creating a new branch (3.5), I suppose that we don't need to have some of the backward-compatibility stuff that we currently have in the code to make sure that 3.3 servers talk to 3.4. servers, yes? Perhaps this is a question for Patrick Hunt.

          It would be awesome if you could work on the leader part. I think the only tricky part on the follower side is making sure that everything is persisted at the right time, but I don't think that we will need major code changes. If you can work on both, I'm happy to be the reviewer of your code, and otherwise I can work on the follower part and we review each other's patches.

          Show
          fpj Flavio Junqueira added a comment - Thawan Kooburat Ok, I got a better clue of what you're referring to with the syncLimit comment, but I'm not there yet. syncLimit has always been a parameter that limits the amount of time a follower can take to catch up, so I'm not proposing any change to the semantics of syncLimit, just to make it clear. About having it for 3.5.0, I suggest we make it a blocker for 3.5.0. If necessary, I also suggest we delay the release to have it in, although certainly not ideal. Given that we will be creating a new branch (3.5), I suppose that we don't need to have some of the backward-compatibility stuff that we currently have in the code to make sure that 3.3 servers talk to 3.4. servers, yes? Perhaps this is a question for Patrick Hunt . It would be awesome if you could work on the leader part. I think the only tricky part on the follower side is making sure that everything is persisted at the right time, but I don't think that we will need major code changes. If you can work on both, I'm happy to be the reviewer of your code, and otherwise I can work on the follower part and we review each other's patches.
          Hide
          fpj Flavio Junqueira added a comment -

          This is a preliminary patch with the changes I'd like to make on the learner side. It simplifies the code in syncWithLeader a lot, since I assume we are able to remove the whole backward compatibility machinery when moving to 3.5.0.

          I have also left code commented out just to illustrate what I have in mind. If it is correct, I can remove them for the final patch.

          It would be great to get some comments on the direction, but keep in mind that this is still preliminary, so I don't recommend anyone to try to run it.

          Show
          fpj Flavio Junqueira added a comment - This is a preliminary patch with the changes I'd like to make on the learner side. It simplifies the code in syncWithLeader a lot, since I assume we are able to remove the whole backward compatibility machinery when moving to 3.5.0. I have also left code commented out just to illustrate what I have in mind. If it is correct, I can remove them for the final patch. It would be great to get some comments on the direction, but keep in mind that this is still preliminary, so I don't recommend anyone to try to run it.
          Hide
          phunt Patrick Hunt added a comment -

          Any update/comments on this? We should really cut a 3.4.6 asap that includes a fix for this issue.

          Show
          phunt Patrick Hunt added a comment - Any update/comments on this? We should really cut a 3.4.6 asap that includes a fix for this issue.
          Hide
          thawan Thawan Kooburat added a comment -

          I will have time to work on it this week. I will try to post an initial version soon.

          Show
          thawan Thawan Kooburat added a comment - I will have time to work on it this week. I will try to post an initial version soon.
          Hide
          mahadev Mahadev konar added a comment -
          Show
          mahadev Mahadev konar added a comment - Thanks Thawan Kooburat !
          Hide
          fpj Flavio Junqueira added a comment -

          In the spirit of coordination, I have posted a preliminary patch with some changes I believe we should be doing on the learner side. I was waiting for comments and after the last batch of messages, I'm wondering if there is anything you need me to do or if I should just leave it up to you. I'm fine either way, but what I heard a few posts above was that Thawan wanted to work on the leader side.

          Show
          fpj Flavio Junqueira added a comment - In the spirit of coordination, I have posted a preliminary patch with some changes I believe we should be doing on the learner side. I was waiting for comments and after the last batch of messages, I'm wondering if there is anything you need me to do or if I should just leave it up to you. I'm fine either way, but what I heard a few posts above was that Thawan wanted to work on the leader side.
          Hide
          thawan Thawan Kooburat added a comment -

          Flavio: The patch on the learner side looks good at the moment. I will have more update as I work on the leader side.

          Here is the current implementation plan and thoughts

          1. The server now initialize its DB by loading from the snapshot only. In order to get lastLastLoggedZxid(), I have walk through the entire txnlog, not just the last file, so we don't reintroduce bug found in ZOOKEEPER-596

          2. When the leader start, it will have empty committedLog (since txnlog is not replayed). However, it will walk through txnlog directly using the same logic it used to walk committedLog.

          3. When the leader have sufficient vote for the NEWLEADER, it apply all the txns to the DataTree and also add to the commitedLog. So any follower that try to synchronize after the quorum is form can use commitedLog for synchronization.

          4. On the learner side, the learner can apply the transactions from txnlog and what it gets from the leader after it receives UPTODATE. Potentially, this can be up to 100K txns with the default configuration. I am worry about the time it takes to do perform this, since the leader already switch to use syncLimit at this point and may tear down the quorum. We can add extra stage for this phase by using the unused UPTODATE's ack but it is not part of the protocol.

          5. There is a logic that kill dead sessions in the current ZooKeeeperServer.loadData(). I am not entirely sure why we need that clean up logic.

          Show
          thawan Thawan Kooburat added a comment - Flavio: The patch on the learner side looks good at the moment. I will have more update as I work on the leader side. Here is the current implementation plan and thoughts 1. The server now initialize its DB by loading from the snapshot only. In order to get lastLastLoggedZxid(), I have walk through the entire txnlog, not just the last file, so we don't reintroduce bug found in ZOOKEEPER-596 2. When the leader start, it will have empty committedLog (since txnlog is not replayed). However, it will walk through txnlog directly using the same logic it used to walk committedLog. 3. When the leader have sufficient vote for the NEWLEADER, it apply all the txns to the DataTree and also add to the commitedLog. So any follower that try to synchronize after the quorum is form can use commitedLog for synchronization. 4. On the learner side, the learner can apply the transactions from txnlog and what it gets from the leader after it receives UPTODATE. Potentially, this can be up to 100K txns with the default configuration. I am worry about the time it takes to do perform this, since the leader already switch to use syncLimit at this point and may tear down the quorum. We can add extra stage for this phase by using the unused UPTODATE's ack but it is not part of the protocol. 5. There is a logic that kill dead sessions in the current ZooKeeeperServer.loadData(). I am not entirely sure why we need that clean up logic.
          Hide
          fpj Flavio Junqueira added a comment -

          One question about this jira: is there any reason why it is marked for 3.4.6 but not for 3.5.0? It should be marked for 3.5.0 too, right?

          Show
          fpj Flavio Junqueira added a comment - One question about this jira: is there any reason why it is marked for 3.4.6 but not for 3.5.0? It should be marked for 3.5.0 too, right?
          Hide
          thawan Thawan Kooburat added a comment -

          This is a combined patch (included Flavio and Jacky patches)

          Show
          thawan Thawan Kooburat added a comment - This is a combined patch (included Flavio and Jacky patches)
          Hide
          thawan Thawan Kooburat added a comment -

          High-level Description.

          Leader:
          1. The leader loads the db from snapshot only. If the quorum hasn't formed, it will synchronize with the learner using proposals read directly from txnlog.

          2. When the quorum is formed (received sufficient NEWLEADER_ACK). The leader reply its txns and add them to committedLog. Any learner joining after this point will only use committedLog for synchronization.

          Learner:
          1. Similar to the leader, the learner loads the db from snapshot only. Since it send lastLoggedZxid to the leader as part of FOLLOWER/OBSERVERINFO it should only get txn that it hasn't seen if it get a DIFF. Otherwise, it will get a SNAP + DIFF

          2. During synchronization (after received DIFF or SNAP packet), the learner will get a stream of txns in form of prosposal + commit. The learner logs each txn to txnlog only when it received a corresponding commit. This is because the learner may receive outstanding proposals before getting UPTODATE if it joins the quorum after it is formed.

          3. When the learner received NEWLEADER, it flushes the txnlog to disk before sending NEWLEADER_ACK back to the leader.

          4. When the learner received UPTODATE, it flushed the txnlog again (in case it see any more committed txn). Then, it replay txnlog that it originally seen before joining the quorum and apply the committed txns that its received from the leader (and currently in memory) to the db. It then sends UPTODATE_ACK back to the leader but the leader is going to ignore this ACK since it is part of the old protocol.

          5. After sending UPTODATE_ACK, the learner starts up the server. For the follower, it will have to process any outstanding proposal that it received during synchronization phase, so that the leader will get ACK from the follower (and the follower don't confuse if it see corresponding commit message from the leader when it start processing request.

          Db initialization:
          1. When loading the snapshot, it also capture the snapshot zxid into a separate variable. This is later used to during txn replay. I didn't want to overload lastProcessedZxid since it may have other implication.

          2. Last logged zxid is obtained in a way to guarantee that we don't run into ZOOKEEPER-596.

          3. I saw the logic that kill dead sessions on db loading. I am not sure why we have this logic and don't know its implication toward data inconsistency.

          4. No snapshot is taken (unless the learner get a snapshot) when the new quorum form. I think we might want to add some logic to trigger snapshot in background soon after the quorum is formed.

          Side Note:

          • One of the major complexity in the existing code on the learner side is due to the fact that the learner may see outstanding proposal before it received UPTODATE. I want to cover this case so that we can do rolling upgrade when pushing this patch/release. The system should behave correctly if rolling upgrade is done by upgrading all the followers first. Zab1_0Test is modified to test this case.
          • The actual change on our internal branch is about half the size of the submitted patch. Another half is a result of our refactoring work on learner synchronization logic as part of ZOOKEEPER-1413. Since we currently use this in our production and it has a better unit testing facility, so I use this synchronization logic instead of the current one in trunk.
          • Is there away to trigger a unit test run and submit a review request against 3.4 branch?
          Show
          thawan Thawan Kooburat added a comment - High-level Description. Leader: 1. The leader loads the db from snapshot only. If the quorum hasn't formed, it will synchronize with the learner using proposals read directly from txnlog. 2. When the quorum is formed (received sufficient NEWLEADER_ACK). The leader reply its txns and add them to committedLog. Any learner joining after this point will only use committedLog for synchronization. Learner: 1. Similar to the leader, the learner loads the db from snapshot only. Since it send lastLoggedZxid to the leader as part of FOLLOWER/OBSERVERINFO it should only get txn that it hasn't seen if it get a DIFF. Otherwise, it will get a SNAP + DIFF 2. During synchronization (after received DIFF or SNAP packet), the learner will get a stream of txns in form of prosposal + commit. The learner logs each txn to txnlog only when it received a corresponding commit. This is because the learner may receive outstanding proposals before getting UPTODATE if it joins the quorum after it is formed. 3. When the learner received NEWLEADER, it flushes the txnlog to disk before sending NEWLEADER_ACK back to the leader. 4. When the learner received UPTODATE, it flushed the txnlog again (in case it see any more committed txn). Then, it replay txnlog that it originally seen before joining the quorum and apply the committed txns that its received from the leader (and currently in memory) to the db. It then sends UPTODATE_ACK back to the leader but the leader is going to ignore this ACK since it is part of the old protocol. 5. After sending UPTODATE_ACK, the learner starts up the server. For the follower, it will have to process any outstanding proposal that it received during synchronization phase, so that the leader will get ACK from the follower (and the follower don't confuse if it see corresponding commit message from the leader when it start processing request. Db initialization: 1. When loading the snapshot, it also capture the snapshot zxid into a separate variable. This is later used to during txn replay. I didn't want to overload lastProcessedZxid since it may have other implication. 2. Last logged zxid is obtained in a way to guarantee that we don't run into ZOOKEEPER-596 . 3. I saw the logic that kill dead sessions on db loading. I am not sure why we have this logic and don't know its implication toward data inconsistency. 4. No snapshot is taken (unless the learner get a snapshot) when the new quorum form. I think we might want to add some logic to trigger snapshot in background soon after the quorum is formed. Side Note: One of the major complexity in the existing code on the learner side is due to the fact that the learner may see outstanding proposal before it received UPTODATE. I want to cover this case so that we can do rolling upgrade when pushing this patch/release. The system should behave correctly if rolling upgrade is done by upgrading all the followers first. Zab1_0Test is modified to test this case. This patch also fixed ZOOKEEPER-1551 The actual change on our internal branch is about half the size of the submitted patch. Another half is a result of our refactoring work on learner synchronization logic as part of ZOOKEEPER-1413 . Since we currently use this in our production and it has a better unit testing facility, so I use this synchronization logic instead of the current one in trunk. Is there away to trigger a unit test run and submit a review request against 3.4 branch?
          Hide
          fpj Flavio Junqueira added a comment -

          Nice work, Thawan! I haven't reviewed the patch yet, but I have two quick points about it:

          1. Jacky's test case was written on QuorumPeerMainTest and I produced one for Zab1_0Test, which I posted on ZOOKEEPER-1558. We don't have to use my test case, but I'd rather have whatever test case we use in Zab1_0.
          2. I was under the impression that we were working towards a 3.5.0 because that way we can remove at least some of the code we have currently in place for backward compatibility. Do you see it differently? It sounds like you're shooting for 3.4 at least.
          Show
          fpj Flavio Junqueira added a comment - Nice work, Thawan! I haven't reviewed the patch yet, but I have two quick points about it: Jacky's test case was written on QuorumPeerMainTest and I produced one for Zab1_0Test, which I posted on ZOOKEEPER-1558 . We don't have to use my test case, but I'd rather have whatever test case we use in Zab1_0. I was under the impression that we were working towards a 3.5.0 because that way we can remove at least some of the code we have currently in place for backward compatibility. Do you see it differently? It sounds like you're shooting for 3.4 at least.
          Hide
          thawan Thawan Kooburat added a comment -

          Ah sorry, I forgot to included your test case in ZOOKEEPER-1558. I can add that one too. Jacky's test is more like end-to-end test. I think we should have one for this patch, but it would be better if the test don't rely too much on ZooKeeper internals since it will be hard to maintain the code moving forward.

          The reason I create the first version on 3.4 is because we still have pending ZOOKEEPER-107 on trunk which may require major rework if that patch get committed.

          I also am aiming to support rolling upgrade for both 3.4.5 -> 3.4.6 and 3.4.x -> 3.5. I am not sure about 3.3.x -> 3.4.6 since some backward compatibility with respect to Zab Pre 1.0 is removed. However, the backward compatibility stuff that need to be retained is allowing the learner to handle outstanding proposals during synchronization phase. This is not related to the protocol but it just the implementation of the leader. I haven't changed the leader to get rid of this behavior in this patch.

          Show
          thawan Thawan Kooburat added a comment - Ah sorry, I forgot to included your test case in ZOOKEEPER-1558 . I can add that one too. Jacky's test is more like end-to-end test. I think we should have one for this patch, but it would be better if the test don't rely too much on ZooKeeper internals since it will be hard to maintain the code moving forward. The reason I create the first version on 3.4 is because we still have pending ZOOKEEPER-107 on trunk which may require major rework if that patch get committed. I also am aiming to support rolling upgrade for both 3.4.5 -> 3.4.6 and 3.4.x -> 3.5. I am not sure about 3.3.x -> 3.4.6 since some backward compatibility with respect to Zab Pre 1.0 is removed. However, the backward compatibility stuff that need to be retained is allowing the learner to handle outstanding proposals during synchronization phase. This is not related to the protocol but it just the implementation of the leader. I haven't changed the leader to get rid of this behavior in this patch.
          Hide
          fpj Flavio Junqueira added a comment -

          My point about the test case is that I feel that such a test should be in Zab1_0Test because it touches that part of the code. I don't find it strictly necessary to have that test case, but it would be great if we had a test case, end-to-end or not, in the right place.

          I agree with the point about ZOOKEEPER-107.

          I am not sure about 3.3.x -> 3.4.6 since some backward compatibility with respect to Zab Pre 1.0 is removed

          That's one reason I think we should be aiming for 3.5.0. Pat is not going to be happy about not having 3.3.x -> 3.4.6.

          Show
          fpj Flavio Junqueira added a comment - My point about the test case is that I feel that such a test should be in Zab1_0Test because it touches that part of the code. I don't find it strictly necessary to have that test case, but it would be great if we had a test case, end-to-end or not, in the right place. I agree with the point about ZOOKEEPER-107 . I am not sure about 3.3.x -> 3.4.6 since some backward compatibility with respect to Zab Pre 1.0 is removed That's one reason I think we should be aiming for 3.5.0. Pat is not going to be happy about not having 3.3.x -> 3.4.6.
          Hide
          thawan Thawan Kooburat added a comment -

          I guess that this JIRA is considered as blocker for 3.4.6 because it is a data inconsistency bug. In order to fix this bug, we decided that modification to the protocol is needed in order to support SNAP+DIFF which should be considered as a significant change. To simplify the implementation, we removed the compatibility with Zab Pre 1.0.

          I think we have to decide if we want to retain backward compatibility or fix data inconsistency for 3.4.6 release. But since there a rolling upgrade path (3.3.x -> 3.4.5 -> 3.4.6) , I don't think backward compatibility is a major problem.

          Show
          thawan Thawan Kooburat added a comment - I guess that this JIRA is considered as blocker for 3.4.6 because it is a data inconsistency bug. In order to fix this bug, we decided that modification to the protocol is needed in order to support SNAP+DIFF which should be considered as a significant change. To simplify the implementation, we removed the compatibility with Zab Pre 1.0. I think we have to decide if we want to retain backward compatibility or fix data inconsistency for 3.4.6 release. But since there a rolling upgrade path (3.3.x -> 3.4.5 -> 3.4.6) , I don't think backward compatibility is a major problem.
          Hide
          fpj Flavio Junqueira added a comment -

          Agreed, it is marked as a blocker for 3.4.6. Because I'd rather not have to deal with backward compatibility 3.3->3.4, my preference is that we focus on a full-fledged patch for 3.5.0. For 3.4.6, we could commit the patch we have for ZOOKEEPER-1558. It is a partial fix that targets the problem that was originally reported in this jira and doesn't have backward compatibility problems to my knowledge. Is it acceptable?

          Show
          fpj Flavio Junqueira added a comment - Agreed, it is marked as a blocker for 3.4.6. Because I'd rather not have to deal with backward compatibility 3.3->3.4, my preference is that we focus on a full-fledged patch for 3.5.0. For 3.4.6, we could commit the patch we have for ZOOKEEPER-1558 . It is a partial fix that targets the problem that was originally reported in this jira and doesn't have backward compatibility problems to my knowledge. Is it acceptable?
          Hide
          thawan Thawan Kooburat added a comment -

          Producing a patch for just one branch is definitely less work. So I am fine with the plan.

          Show
          thawan Thawan Kooburat added a comment - Producing a patch for just one branch is definitely less work. So I am fine with the plan.
          Hide
          fanster.z Jacky007 added a comment -

          Thawan Kooburat It looks great. A question not related to this patch, why zk would choose to clear data base and load snapshot again rather than keep it in memory and reuse again when learner or leader fails?

          Show
          fanster.z Jacky007 added a comment - Thawan Kooburat It looks great. A question not related to this patch, why zk would choose to clear data base and load snapshot again rather than keep it in memory and reuse again when learner or leader fails?
          Hide
          thawan Thawan Kooburat added a comment -

          Jacky007 I think it is a side effect of existing mechanism that recreate Follower/Leader/ObserverZooKeeperServer whenever there is a leader election. I believe that if uncommitted txn never reach DataTree, there is no need to reload the database.

          In our Read-only Observer (ZOOKEEPER-1607), we could even retain the server and its processing pipeline across leader election.

          Show
          thawan Thawan Kooburat added a comment - Jacky007 I think it is a side effect of existing mechanism that recreate Follower/Leader/ObserverZooKeeperServer whenever there is a leader election. I believe that if uncommitted txn never reach DataTree, there is no need to reload the database. In our Read-only Observer ( ZOOKEEPER-1607 ), we could even retain the server and its processing pipeline across leader election.
          Hide
          fpj Flavio Junqueira added a comment -

          Since no one objected to my plan, I'm marking this issue for 3.5.0 only. I've also marked ZOOKEEPER-1558 for 3.4.6.

          Show
          fpj Flavio Junqueira added a comment - Since no one objected to my plan, I'm marking this issue for 3.5.0 only. I've also marked ZOOKEEPER-1558 for 3.4.6.
          Hide
          fpj Flavio Junqueira added a comment -

          Thawan Kooburat One question about this patch. It seems that it touches the part of the code that we fix in ZOOKEEPER-1642. Should we incorporate ZOOKEEPER-1642 into this patch? We should still load the database conditionally as in ZOOKEEPER-1642, right?

          Show
          fpj Flavio Junqueira added a comment - Thawan Kooburat One question about this patch. It seems that it touches the part of the code that we fix in ZOOKEEPER-1642 . Should we incorporate ZOOKEEPER-1642 into this patch? We should still load the database conditionally as in ZOOKEEPER-1642 , right?
          Hide
          thawan Thawan Kooburat added a comment -

          Conditional db loading is already done but at a different location. You can see ZKDatabase.loadDataBase() in the patch.

          Show
          thawan Thawan Kooburat added a comment - Conditional db loading is already done but at a different location. You can see ZKDatabase.loadDataBase() in the patch.
          Hide
          fpj Flavio Junqueira added a comment -

          Sounds right, but one problem I'm seeing is that in loadData() the patch is clearing the database and reloading it, which will cause the leader to load the database twice during recovery, which is the problem described in ZOOKEEPER-1642.

          Show
          fpj Flavio Junqueira added a comment - Sounds right, but one problem I'm seeing is that in loadData() the patch is clearing the database and reloading it, which will cause the leader to load the database twice during recovery, which is the problem described in ZOOKEEPER-1642 .
          Hide
          thawan Thawan Kooburat added a comment -

          Ah yes, I forgot about that. In our branch we do that since we saw duplicate txn in commitLog. With your diff, that should not happen, so I can remove it.

          Since we are no longer need to push this to 3.4.6. I am thinking of splitting this patch a bit. About half of the patch is actually a modified version of ZOOKEEPER-1413. If you agree, I can upload a patch for that jira when ZK-107 is committed.

          Show
          thawan Thawan Kooburat added a comment - Ah yes, I forgot about that. In our branch we do that since we saw duplicate txn in commitLog. With your diff, that should not happen, so I can remove it. Since we are no longer need to push this to 3.4.6. I am thinking of splitting this patch a bit. About half of the patch is actually a modified version of ZOOKEEPER-1413 . If you agree, I can upload a patch for that jira when ZK-107 is committed.
          Hide
          fpj Flavio Junqueira added a comment -

          With your diff, that should not happen, so I can remove it.

          Just so that I make sure we are on the same page:

          1. when you say my diff you're referring to the patch in ZOOKEEPER-1642
          2. when you say that it is ok to remove it, you're referring to clearing the database in loadData.

          In your latest patch, loadDataBase already loads conditionally, so if we simply don't clear the database in loadData, then we should avoid having the leader loading it twice. Do you agree? If so, then I don't think we need the patch in ZOOKEEPER-1642 at all, your new version of loadDataBase should take care of it.

          Since we are no longer need to push this to 3.4.6. I am thinking of splitting this patch a bit. About half of the patch is actually a modified version of ZOOKEEPER-1413. If you agree, I can upload a patch for that jira when ZK-107 is committed.

          Sounds like a plan.

          Show
          fpj Flavio Junqueira added a comment - With your diff, that should not happen, so I can remove it. Just so that I make sure we are on the same page: when you say my diff you're referring to the patch in ZOOKEEPER-1642 when you say that it is ok to remove it, you're referring to clearing the database in loadData. In your latest patch, loadDataBase already loads conditionally, so if we simply don't clear the database in loadData, then we should avoid having the leader loading it twice. Do you agree? If so, then I don't think we need the patch in ZOOKEEPER-1642 at all, your new version of loadDataBase should take care of it. Since we are no longer need to push this to 3.4.6. I am thinking of splitting this patch a bit. About half of the patch is actually a modified version of ZOOKEEPER-1413 . If you agree, I can upload a patch for that jira when ZK-107 is committed. Sounds like a plan.
          Hide
          fanster.z Jacky007 added a comment -

          Thawan Kooburat It sounds the right time to fix this.

          With this patch, it should be safe to do ZOOKEEPER-1674

          Show
          fanster.z Jacky007 added a comment - Thawan Kooburat It sounds the right time to fix this. With this patch, it should be safe to do ZOOKEEPER-1674
          Hide
          fanster.z Jacky007 added a comment -

          By the way, I think we should fix ZOOKEEPER-1667

          Show
          fanster.z Jacky007 added a comment - By the way, I think we should fix ZOOKEEPER-1667
          Hide
          fpj Flavio Junqueira added a comment -

          it has been too long, can we try to wrap up this jira?

          Show
          fpj Flavio Junqueira added a comment - it has been too long, can we try to wrap up this jira?
          Hide
          fpj Flavio Junqueira added a comment -

          We need to discuss the order of patches to get in. I have also asked about the patch I have uploaded to ZOOKEEPER-1642 and how it relates to this one.

          Show
          fpj Flavio Junqueira added a comment - We need to discuss the order of patches to get in. I have also asked about the patch I have uploaded to ZOOKEEPER-1642 and how it relates to this one.
          Hide
          fpj Flavio Junqueira added a comment -

          After having a look at both the patch here and the one in ZOOKEEPER-1642, it sounds like this patch here is not solving that problem. In that jira, I'm trying to avoid loading the database a second time, while the patch here clears the database in loadData before reloading it, so it still loads a second time.

          Thawan Kooburat, you say that with that patch you can remove the part that clears and reloads the database. It seems right for the case I'm focusing on for ZOOKEEPER-1642, but I wonder if this is right in general. I was also wondering if we could also remove the code in loadDataBase that replays the txn log conditionally. We should be able to do it and to make sure that we don't have duplicates in the commitLog, perhaps we could simply check the highest zxid currently present there, no? It might simplify the code if we don't have those boolean flags around.

          Show
          fpj Flavio Junqueira added a comment - After having a look at both the patch here and the one in ZOOKEEPER-1642 , it sounds like this patch here is not solving that problem. In that jira, I'm trying to avoid loading the database a second time, while the patch here clears the database in loadData before reloading it, so it still loads a second time. Thawan Kooburat , you say that with that patch you can remove the part that clears and reloads the database. It seems right for the case I'm focusing on for ZOOKEEPER-1642 , but I wonder if this is right in general. I was also wondering if we could also remove the code in loadDataBase that replays the txn log conditionally. We should be able to do it and to make sure that we don't have duplicates in the commitLog, perhaps we could simply check the highest zxid currently present there, no? It might simplify the code if we don't have those boolean flags around.
          Hide
          thawan Thawan Kooburat added a comment -

          Database loading:
          I think we will eventually need to be able to reuse the database instance across leader election round. This is mainly for improving availability since we will have much lower downtime.

          Replay txnlog conditionally:
          I think I was trying to mimic the existing code that initialize database conditionally. Ideally, it should have been more of a sanity check because we should only try to replay txnlog once.

          We should commit ZOOKEEPER-1642 before I revisit this patch again.

          Show
          thawan Thawan Kooburat added a comment - Database loading: I think we will eventually need to be able to reuse the database instance across leader election round. This is mainly for improving availability since we will have much lower downtime. Replay txnlog conditionally: I think I was trying to mimic the existing code that initialize database conditionally. Ideally, it should have been more of a sanity check because we should only try to replay txnlog once. We should commit ZOOKEEPER-1642 before I revisit this patch again.
          Hide
          liping Liping added a comment -

          Hi

          ZK-1653 is duplicated with this one which is targeted to 3.5.0 finally. Obviously it would spend lots of time to deliver 3.5.0. So could you please help to re-open ZK-1653 and provide a easy patch for 3.4.6 only? This issue happened several times in our environment.

          Really appreciate your help.

          Best Regards
          Liping

          Show
          liping Liping added a comment - Hi ZK-1653 is duplicated with this one which is targeted to 3.5.0 finally. Obviously it would spend lots of time to deliver 3.5.0. So could you please help to re-open ZK-1653 and provide a easy patch for 3.4.6 only? This issue happened several times in our environment. Really appreciate your help. Best Regards Liping
          Hide
          fpj Flavio Junqueira added a comment -

          Hi Liping, We have addressed this issue for 3.4.6 in ZOOKEEPER-1558.

          Show
          fpj Flavio Junqueira added a comment - Hi Liping, We have addressed this issue for 3.4.6 in ZOOKEEPER-1558 .
          Hide
          liping Liping added a comment -

          Hi Flavio

          Thanks for your quick response.

          But I do not understand. The ZK-1558 looks only avoid adding the uncommitted txnlog to snapshot file during leader startup.
          For ZK-1653, if the leaner crashed b/w takeSnapshot() and setCurrentEpoch() after receiving NEALEADER or UPTODATE qp, the leaner restart seems would still get the 'java.io.IOException: The current epoch, x, is older than the last zxid xxxxx' exception. Right?

          Could you please help to elaborate what I missed?

          Thanks a lot!
          Liping

          Show
          liping Liping added a comment - Hi Flavio Thanks for your quick response. But I do not understand. The ZK-1558 looks only avoid adding the uncommitted txnlog to snapshot file during leader startup. For ZK-1653, if the leaner crashed b/w takeSnapshot() and setCurrentEpoch() after receiving NEALEADER or UPTODATE qp, the leaner restart seems would still get the 'java.io.IOException: The current epoch, x, is older than the last zxid xxxxx' exception. Right? Could you please help to elaborate what I missed? Thanks a lot! Liping
          Hide
          fpj Flavio Junqueira added a comment -

          Here is my train of thought. I marked ZOOKEEPER-1653 as a duplicate because in my understanding the changes proposed here solve the problem described there. ZOOKEEPER-1558 solves the problem originally reported in this issue, but it does not solve all corner cases. We have left it for 3.5.0 for the reasons Thawan Kooburat raised here previously.

          Since we are discussing in ZOOKEEPER-1549, I'm referring to the issue reported and discussed here. If you feel that ZOOKEEPER-1653 should be addressed and should be a blocker for 3.4.6, then we should discuss it there. But, please keep in mind that if we make it a blocker, it will delay the release even further.

          Show
          fpj Flavio Junqueira added a comment - Here is my train of thought. I marked ZOOKEEPER-1653 as a duplicate because in my understanding the changes proposed here solve the problem described there. ZOOKEEPER-1558 solves the problem originally reported in this issue, but it does not solve all corner cases. We have left it for 3.5.0 for the reasons Thawan Kooburat raised here previously. Since we are discussing in ZOOKEEPER-1549 , I'm referring to the issue reported and discussed here. If you feel that ZOOKEEPER-1653 should be addressed and should be a blocker for 3.4.6, then we should discuss it there. But, please keep in mind that if we make it a blocker, it will delay the release even further.
          Hide
          liping Liping added a comment -

          Thanks for the clarification, Flavio
          Yeah, I would follow ZK-1653 for the related discussion if necessary. Michi just re-opened it for a 3.4.6 patch. Thanks for all of your help!

          Show
          liping Liping added a comment - Thanks for the clarification, Flavio Yeah, I would follow ZK-1653 for the related discussion if necessary. Michi just re-opened it for a 3.4.6 patch. Thanks for all of your help!
          Hide
          fpj Flavio Junqueira added a comment -

          hey Thawan Kooburat,

          We should commit ZOOKEEPER-1642 before I revisit this patch again.

          Done! Despite the delay, can we try to wrap this up?

          Show
          fpj Flavio Junqueira added a comment - hey Thawan Kooburat , We should commit ZOOKEEPER-1642 before I revisit this patch again. Done! Despite the delay, can we try to wrap this up?
          Hide
          thawan Thawan Kooburat added a comment -

          Here is the recap on the issue, for those who just found this JIRA

          Problem:
          When the leader started, it will treat every txn in its txnlog as committed and apply all of them into its in-memory data tree even though some of them was only acked by the leader (or the minority).

          If there is a follow that need to synchronize with the leader via snapshot. The follower will get a snapshot with uncommitted txns in it and take dirty snapshot to disk. If there is a leader failure, it is possible that uncommitted txn is discarded in the next leader election round so this follower will have dirty snapshot on disk and there is no way it can recovered from this.

          The solution so far:
          The fix on the follower side is to simply not taking snapshot until the quorum switch to broadcast phase. The follower can have dirty snapshot in memory but as long as it doesn’t write to disk, we are ok and part of the issue is addressed

          On the leader side, the proposed patch is to change server startup and synchronization sequence. Uncommitted txn (any txn after the last snapshot) should never get applied to the data tree until synchronization phase is done. We use synchronization phase to catchup all follower and imply that all of the follower accepted the txn. Then, we apply these txns before starting broadcast phase.

          I will try to find someone on my team to help on this.

          Show
          thawan Thawan Kooburat added a comment - Here is the recap on the issue, for those who just found this JIRA Problem: When the leader started, it will treat every txn in its txnlog as committed and apply all of them into its in-memory data tree even though some of them was only acked by the leader (or the minority). If there is a follow that need to synchronize with the leader via snapshot. The follower will get a snapshot with uncommitted txns in it and take dirty snapshot to disk. If there is a leader failure, it is possible that uncommitted txn is discarded in the next leader election round so this follower will have dirty snapshot on disk and there is no way it can recovered from this. The solution so far: The fix on the follower side is to simply not taking snapshot until the quorum switch to broadcast phase. The follower can have dirty snapshot in memory but as long as it doesn’t write to disk, we are ok and part of the issue is addressed On the leader side, the proposed patch is to change server startup and synchronization sequence. Uncommitted txn (any txn after the last snapshot) should never get applied to the data tree until synchronization phase is done. We use synchronization phase to catchup all follower and imply that all of the follower accepted the txn. Then, we apply these txns before starting broadcast phase. I will try to find someone on my team to help on this.
          Hide
          hdeng Hongchao Deng added a comment -

          Hi Thawan Kooburat,

          > The follower can have dirty snapshot in memory but as long as it doesn’t write to disk

          Do you mean the following case:
          1. A was the leader for B, C
          2. A partitioned but got a new request. A wrote the request to the log.
          3. A crashed and restarted. But A restore everything in the log including those uncommitted, i.e. the new request in this case.
          4. A joined the quorum that B and C formed (e.g. C is the leader)
          5. Since C is the leader, C truncates A's log.
          6. But A has dirty in-memory tree (plus the new request made in step 3).

          Correct?

          Show
          hdeng Hongchao Deng added a comment - Hi Thawan Kooburat , > The follower can have dirty snapshot in memory but as long as it doesn’t write to disk Do you mean the following case: 1. A was the leader for B, C 2. A partitioned but got a new request. A wrote the request to the log. 3. A crashed and restarted. But A restore everything in the log including those uncommitted, i.e. the new request in this case. 4. A joined the quorum that B and C formed (e.g. C is the leader) 5. Since C is the leader, C truncates A's log. 6. But A has dirty in-memory tree (plus the new request made in step 3). Correct?
          Hide
          hdeng Hongchao Deng added a comment -

          I think the root of the problem is in this function:
          ```
          public long restore(DataTree dt, Map<Long, Integer> sessions,
          PlayBackListener listener) throws IOException {
          ...
          while (true)

          { // iterator points to // the first valid txn when initialized hdr = itr.getHeader(); ... processTransaction(hdr,dt,sessions, itr.getTxn()); ... listener.onTxnLoaded(hdr, itr.getTxn()); if (!itr.next()) break; }

          ...
          ```
          Instead of processing all transactions in the log, keeping an index of committedUpTo and processing until that makes more sense. Am I correct??

          Show
          hdeng Hongchao Deng added a comment - I think the root of the problem is in this function: ``` public long restore(DataTree dt, Map<Long, Integer> sessions, PlayBackListener listener) throws IOException { ... while (true) { // iterator points to // the first valid txn when initialized hdr = itr.getHeader(); ... processTransaction(hdr,dt,sessions, itr.getTxn()); ... listener.onTxnLoaded(hdr, itr.getTxn()); if (!itr.next()) break; } ... ``` Instead of processing all transactions in the log, keeping an index of committedUpTo and processing until that makes more sense. Am I correct??
          Hide
          hdeng Hongchao Deng added a comment -
          Show
          hdeng Hongchao Deng added a comment - I don't know how to use the code formatting here.. So paste the link: https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java#L158
          Hide
          hdeng Hongchao Deng added a comment -

          Flavio JunqueiraThawan KooburatJacky007

          The fix on the follower side is to simply not taking snapshot until the quorum switch to broadcast phase. The follower can have dirty snapshot in memory but as long as it doesn’t write to disk, we are ok and part of the issue is addressed.

          On the leader side, the proposed patch is to change server startup and synchronization sequence. Uncommitted txn (any txn after the last snapshot) should never get applied to the data tree until synchronization phase is done. We use synchronization phase to catchup all follower and imply that all of the follower accepted the txn. Then, we apply these txns before starting broadcast phase.

          Is this issue fixed yet? Or does anyone have internal patch for this?
          This issue seems kinda major thing for stability.

          Show
          hdeng Hongchao Deng added a comment - Flavio Junqueira Thawan Kooburat Jacky007 The fix on the follower side is to simply not taking snapshot until the quorum switch to broadcast phase. The follower can have dirty snapshot in memory but as long as it doesn’t write to disk, we are ok and part of the issue is addressed. On the leader side, the proposed patch is to change server startup and synchronization sequence. Uncommitted txn (any txn after the last snapshot) should never get applied to the data tree until synchronization phase is done. We use synchronization phase to catchup all follower and imply that all of the follower accepted the txn. Then, we apply these txns before starting broadcast phase. Is this issue fixed yet? Or does anyone have internal patch for this? This issue seems kinda major thing for stability.
          Hide
          fpj Flavio Junqueira added a comment -

          I've done some further analysis of this issue after some time (it took me a while to recover all uncached state), and here is what I think needs to happen.

          It is not incorrect that we take snapshots where we are taking. Snapshots are simply compacted versions of the txn log history. The real problem is that we are apparently truncating the txn log appropriately, but are loading a snapshot that does not reflect the state of the txn log. What needs to happen is that once we truncate, we need to make sure that the snapshot we load reflects the right txn log state, which we aren't doing and is causing the problem described here. The solution is to go back to an earlier snapshot and rebuild the state from that snapshot. Jacky007 actually proposed early on to delete the snapshot and that is sounding like a viable way to sort this out.

          I'll let people chime in and otherwise and I'll mature the idea a bit longer and will start working on a patch.

          Show
          fpj Flavio Junqueira added a comment - I've done some further analysis of this issue after some time (it took me a while to recover all uncached state), and here is what I think needs to happen. It is not incorrect that we take snapshots where we are taking. Snapshots are simply compacted versions of the txn log history. The real problem is that we are apparently truncating the txn log appropriately, but are loading a snapshot that does not reflect the state of the txn log. What needs to happen is that once we truncate, we need to make sure that the snapshot we load reflects the right txn log state, which we aren't doing and is causing the problem described here. The solution is to go back to an earlier snapshot and rebuild the state from that snapshot. Jacky007 actually proposed early on to delete the snapshot and that is sounding like a viable way to sort this out. I'll let people chime in and otherwise and I'll mature the idea a bit longer and will start working on a patch.
          Hide
          EdRowe Ed Rowe added a comment -

          Flavio Junqueira I think the approach of deleting snapshots is a good one. Further, I think you hit the nail on the head when you say "Snapshots are simply compacted versions of the txn log history." The previous idea (further up the comment history) in which the log, but not snapshots, would be allowed to contain uncommitted transactions seems fragile. A pedantic update to your definition would be "Snapshots are simply compacted versions of the txn log history, as applied to the DataTree." to recognize that snapshots only ever contain information that has been in a DataTree (though not necessarily a DataTree that was ever visible outside a given Node) while logs can contain information that has never been in a DataTree.

          One issue to account for in the fix is the case where there is no earlier snapshot to rebuild from. This could occur if an operator has deleted older snapshots. I think we'd still want to delete the snapshot being truncated and arrange for the learner node to start over with a blank database.

          Another consideration is that, as I've written in ZOOKEEPER-2436, the learner might receive from the leader a SNAP rather than a TRUNC. In this situation, the snapshot on the learner node that a TRUNC would have deleted will still be present on the learner node, but it will no longer be the newest snapshot. I don't think this will cause any problems but I did want to bring it up.

          Finally, there were some concerns raised early in the comment thread that deleting snapshots might be too aggressive. If this is still an issue, we could (perhaps optionally) rename logs and snapshots that we are truncating rather than delete them. The snapshot/log purge code might then clean these up if they are old enough. I'm not convinced that this is worth implementing now.

          Show
          EdRowe Ed Rowe added a comment - Flavio Junqueira I think the approach of deleting snapshots is a good one. Further, I think you hit the nail on the head when you say "Snapshots are simply compacted versions of the txn log history." The previous idea (further up the comment history) in which the log, but not snapshots, would be allowed to contain uncommitted transactions seems fragile. A pedantic update to your definition would be "Snapshots are simply compacted versions of the txn log history, as applied to the DataTree." to recognize that snapshots only ever contain information that has been in a DataTree (though not necessarily a DataTree that was ever visible outside a given Node) while logs can contain information that has never been in a DataTree. One issue to account for in the fix is the case where there is no earlier snapshot to rebuild from. This could occur if an operator has deleted older snapshots. I think we'd still want to delete the snapshot being truncated and arrange for the learner node to start over with a blank database. Another consideration is that, as I've written in ZOOKEEPER-2436 , the learner might receive from the leader a SNAP rather than a TRUNC. In this situation, the snapshot on the learner node that a TRUNC would have deleted will still be present on the learner node, but it will no longer be the newest snapshot. I don't think this will cause any problems but I did want to bring it up. Finally, there were some concerns raised early in the comment thread that deleting snapshots might be too aggressive. If this is still an issue, we could (perhaps optionally) rename logs and snapshots that we are truncating rather than delete them. The snapshot/log purge code might then clean these up if they are old enough. I'm not convinced that this is worth implementing now.
          Hide
          fpj Flavio Junqueira added a comment -

          "Snapshots are simply compacted versions of the txn log history, as applied to the DataTree."

          This is partially right and I acknowledge that my statement isn't that precise either. The root of the problem is that sometimes we treat snapshots as having only committed data (during the broadcast phase) and other times we produce snapshots that have uncommitted data (during recovery). Whatever fix we have needs to be such either snapshots contain committed data only or we enable snapshots to be deleted. I don't personally like the idea of deleting snapshots because if we don't get it right, then we will be making the zk state inconsistent by losing quorum on some transactions. In fact, one scenario to be aware of is the one in which an earlier snapshot has been committed while a more recent one hasn't. If we wipe out the former snapshot, then we are in trouble.

          The one important thing to keep in mind is that we can't truncate a snapshot, so either we have only committed state in snapshots so that we never fall into this situation of having to truncate a snapshot or we start deleting snapshots as a brute-force way of truncating, but in this latter case, we need to be really careful.

          One issue to account for in the fix is the case where there is no earlier snapshot to rebuild from.

          I'm not concerned about this case because we want the leader to transfer data to the learner. I'm clearly assuming that if any earlier snapshot has been committed by the same learner, then we aren't going to discard that snapshot.

          the snapshot on the learner node that a TRUNC would have deleted will still be present on the learner node, but it will no longer be the newest snapshot.

          If the leaner has a newer valid snapshot, then it shouldn't be a problem. However, if the learner has to load, then it will have to treat it as if it were the latest (which probably is, latest valid)
          .

          there were some concerns raised early in the comment thread that deleting snapshots might be too aggressive

          One important issue to keep in mind is that a quorum might commit a snapshot and I'm concern that if we start deleting snapshots, we will end up with bugs where we delete snapshots incorrectly.

          My ideal approach is:

          1. Snapshots only contain committed data
          2. In the case a leader needs to transfer a snapshot, then transfer the latest snapshot containing the committed data and a suffix from the log.

          However, it is sounding a bit difficult to do it in a backwards compatible manner.

          Show
          fpj Flavio Junqueira added a comment - "Snapshots are simply compacted versions of the txn log history, as applied to the DataTree." This is partially right and I acknowledge that my statement isn't that precise either. The root of the problem is that sometimes we treat snapshots as having only committed data (during the broadcast phase) and other times we produce snapshots that have uncommitted data (during recovery). Whatever fix we have needs to be such either snapshots contain committed data only or we enable snapshots to be deleted. I don't personally like the idea of deleting snapshots because if we don't get it right, then we will be making the zk state inconsistent by losing quorum on some transactions. In fact, one scenario to be aware of is the one in which an earlier snapshot has been committed while a more recent one hasn't. If we wipe out the former snapshot, then we are in trouble. The one important thing to keep in mind is that we can't truncate a snapshot, so either we have only committed state in snapshots so that we never fall into this situation of having to truncate a snapshot or we start deleting snapshots as a brute-force way of truncating, but in this latter case, we need to be really careful. One issue to account for in the fix is the case where there is no earlier snapshot to rebuild from. I'm not concerned about this case because we want the leader to transfer data to the learner. I'm clearly assuming that if any earlier snapshot has been committed by the same learner, then we aren't going to discard that snapshot. the snapshot on the learner node that a TRUNC would have deleted will still be present on the learner node, but it will no longer be the newest snapshot. If the leaner has a newer valid snapshot, then it shouldn't be a problem. However, if the learner has to load, then it will have to treat it as if it were the latest (which probably is, latest valid) . there were some concerns raised early in the comment thread that deleting snapshots might be too aggressive One important issue to keep in mind is that a quorum might commit a snapshot and I'm concern that if we start deleting snapshots, we will end up with bugs where we delete snapshots incorrectly. My ideal approach is: Snapshots only contain committed data In the case a leader needs to transfer a snapshot, then transfer the latest snapshot containing the committed data and a suffix from the log. However, it is sounding a bit difficult to do it in a backwards compatible manner.
          Hide
          phunt Patrick Hunt added a comment -

          Shouldn't the fix version also include 3.4? The effects lists 3.4.3 currently.

          Show
          phunt Patrick Hunt added a comment - Shouldn't the fix version also include 3.4? The effects lists 3.4.3 currently.

            People

            • Assignee:
              fpj Flavio Junqueira
              Reporter:
              fanster.z Jacky007
            • Votes:
              1 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:

                Development