[KAFKA-1647] Replication offset checkpoints (high water marks) can be lost on hard kills and restarts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.8.2.0
Fix Version/s: 0.8.2.0
Component/s: None
Labels:
- newbie++

Description

We ran into this scenario recently in a production environment. This can happen when enough brokers in a cluster are taken down. i.e., a rolling bounce done properly should not cause this issue. It can occur if all replicas for any partition are taken down.

Here is a sample scenario:

Cluster of three brokers: b0, b1, b2
Two partitions (of some topic) with replication factor two: p0, p1
Initial state:
p0: leader = b0, ISR = {b0, b1}
p1: leader = b1, ISR = {b0, b1}
Do a parallel hard-kill of all brokers
Bring up b2, so it is the new controller
b2 initializes its controller context and populates its leader/ISR cache (i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last known leaders are b0 (for p0) and b1 (for p2)
Bring up b1
The controller's onBrokerStartup procedure initiates a replica state change for all replicas on b1 to become online. As part of this replica state change it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1 (for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1: leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not included in the leaders field because b0 is down.
On receiving the LeaderAndIsrRequest, b1's replica manager will successfully make itself (b1) the leader for p1 (and create the local replica object corresponding to p1). It will however abort the become follower transition for p0 because the designated leader b0 is offline. So it will not create the local replica object for p0.
It will then start the high water mark checkpoint thread. Since only p1 has a local replica object, only p1's high water mark will be checkpointed to disk. p0's previously written checkpoint if any will be lost.

So in summary it seems we should always create the local replica object even if the online transition does not happen.

Possible symptoms of the above bug could be one or more of the following (we saw 2 and 3):

Data loss; yes on a hard-kill data loss is expected, but this can actually cause loss of nearly all data if the broker becomes follower, truncates, and soon after happens to become leader.
High IO on brokers that lose their high water mark then subsequently (on a successful become follower transition) truncate their log to zero and start catching up from the beginning.
If the offsets topic is affected, then offsets can get reset. This is because during an offset load we don't read past the high water mark. So if a water mark is missing then we don't load anything (even if the offsets are there in the log).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

KAFKA-1647_2014-10-13_16:38:39.patch
13/Oct/14 23:38
7 kB
Jiangjie Qin
KAFKA-1647_2014-10-18_00:26:51.patch
18/Oct/14 07:26
9 kB
Jiangjie Qin
KAFKA-1647_2014-10-21_23:08:43.patch
22/Oct/14 06:08
10 kB
Jiangjie Qin
KAFKA-1647_2014-10-27_17:19:07.patch
28/Oct/14 00:19
12 kB
Jiangjie Qin
KAFKA-1647_2014-10-30_15:07:09.patch
30/Oct/14 22:07
7 kB
Jiangjie Qin
KAFKA-1647_2014-10-30_15:10:22.patch
30/Oct/14 22:10
8 kB
Jiangjie Qin
KAFKA-1647_2014-10-30_15:38:02.patch
30/Oct/14 22:38
10 kB
Jiangjie Qin
KAFKA-1647_2014-10-30_15:50:33.patch
30/Oct/14 22:50
12 kB
Jiangjie Qin
KAFKA-1647.patch
06/Oct/14 17:06
4 kB
Jiangjie Qin

Activity

People

Assignee:: Jiangjie Qin

Reporter:: Joel Jacob Koshy

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 23/Sep/14 17:35

Updated:: 30/Oct/14 23:33

Resolved:: 30/Oct/14 23:33