[ZOOKEEPER-2845] Data inconsistency issue due to retain database in leader election - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.4.10, 3.5.3, 3.6.0
Fix Version/s: 3.5.4, 3.6.0, 3.4.12
Component/s: quorum
Labels:
- pull-request-available

Description

In ~~ZOOKEEPER-2678~~, the ZKDatabase is retained to reduce the unavailable time during leader election. In ZooKeeper ensemble, it's possible that the snapshot is ahead of txn file (due to slow disk on the server, etc), or the txn file is ahead of snapshot due to no commit message being received yet.

If snapshot is ahead of txn file, since the SyncRequestProcessor queue will be drained during shutdown, the snapshot and txn file will keep consistent before leader election happening, so this is not an issue.

But if txn is ahead of snapshot, it's possible that the ensemble will have data inconsistent issue, here is the simplified scenario to show the issue:

Let's say we have a 3 servers in the ensemble, server A and B are followers, and C is leader, and all the snapshot and txn are up to T0:
1. A new request reached to leader C to create Node N, and it's converted to txn T1
2. Txn T1 was synced to disk in C, but just before the proposal reaching out to the followers, A and B restarted, so the T1 didn't exist in A and B
3. A and B formed a new quorum after restart, let's say B is the leader
4. C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync
5. Before C take snapshot it restarted, it replayed the txns on disk which includes T1, now it will have Node N, but A and B doesn't have it.

Also I included the a test case to reproduce this issue consistently.

We have a totally different RetainDB version which will avoid this issue by doing consensus between snapshot and txn files before leader election, will submit for review.

Attachments

Issue Links

is broken by

ZOOKEEPER-2678 Large databases take a long time to regain a quorum

Closed

links to

GitHub Pull Request #310

GitHub Pull Request #453

GitHub Pull Request #454

GitHub Pull Request #455

Activity

People

Assignee:: Robert Joseph Evans

Reporter:: Fangmin Lv

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 14/Jul/17 23:20

Updated:: 14/Sep/18 00:19

Resolved:: 23/Feb/18 23:20

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m