[ZOOKEEPER-4882] Data loss after restarting an node experienced temporary disk error and rejoin - ASF JIRA

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.8.4, 3.9.3
Fix Version/s: None
Component/s: server
Labels:
None

Description

The cause is multifold:
1. Leader will commit a proposal once quorum acked.
2. Proposal is able to be committed in node's memory even if it has not
been written to that node's disk.
3. In case of disk error, the txn log could lag behind memory database.

The above applies to both leader and follower. I have not verified leader branch, let's consider only follower for now.

f4. A follower experienced temporary disk error will have hole in txn log
after re-join.
f5. Restarted follower will lose the data. Worse, it is able to win
election and propagate data loss to whole cluster.

I authored commits in my repo to expose this.

https://github.com/kezhuw/zookeeper/commits/data-loss-temporary-sync-disk-error/