[ZOOKEEPER-2872] Interrupted snapshot sync causes data loss - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.10, 3.5.3, 3.6.0
Fix Version/s: 3.6.0
Component/s: server
Labels:
- pull-request-available

Description

There is a way for observers to permanently lose data from their local data tree while remaining members of good standing with the ensemble and continuing to serve client traffic when the following chain of events occurs.

1. The observer dies in epoch N from machine failure.
2. The observer comes back up in epoch N+1 and requests a snapshot sync to catch up.
3. The machine powers off before the snapshot is synced to disc and after some txn's have been logged (depending on the OS, this can happen!).
4. The observer comes back a second time and replays its most recent snapshot (epoch <= N) as well as the txn logs (epoch N+1).
5. A diff sync is requested from the leader and the observer broadcasts availability.

In this scenario, any commits from epoch N that the observer did not receive before it died the first time will never be exposed to the observer and no part of the ensemble will complain.

This situation is not unique to observers and can happen to any learner. As a simple fix, fsync-ing the snapshots received from the leader will avoid the case of missing snapshots causing data loss.

Attachments

Issue Links

fixes

ZOOKEEPER-2310 Snapshot files must be synced to prevent inconsistency or data loss

Resolved

is related to

ZOOKEEPER-4433 Backport ZOOKEEPER-2872 for branch-3.5 (Interrupted snapshot sync causes data loss)

Resolved

links to

GitHub Pull Request #333

GitHub Pull Request #1146

Activity

People

Assignee:: Brian Nixon

Reporter:: Brian Nixon

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 10/Aug/17 18:07

Updated:: 03/Jan/22 04:49

Resolved:: 11/Feb/21 19:19

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m