Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-1647

Replication offset checkpoints (high water marks) can be lost on hard kills and restarts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.8.2.0
    • 0.8.2.0
    • None

    Description

      We ran into this scenario recently in a production environment. This can happen when enough brokers in a cluster are taken down. i.e., a rolling bounce done properly should not cause this issue. It can occur if all replicas for any partition are taken down.

      Here is a sample scenario:

      • Cluster of three brokers: b0, b1, b2
      • Two partitions (of some topic) with replication factor two: p0, p1
      • Initial state:
        p0: leader = b0, ISR = {b0, b1}
        p1: leader = b1, ISR = {b0, b1}
      • Do a parallel hard-kill of all brokers
      • Bring up b2, so it is the new controller
      • b2 initializes its controller context and populates its leader/ISR cache (i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last known leaders are b0 (for p0) and b1 (for p2)
      • Bring up b1
      • The controller's onBrokerStartup procedure initiates a replica state change for all replicas on b1 to become online. As part of this replica state change it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1 (for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1: leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not included in the leaders field because b0 is down.
      • On receiving the LeaderAndIsrRequest, b1's replica manager will successfully make itself (b1) the leader for p1 (and create the local replica object corresponding to p1). It will however abort the become follower transition for p0 because the designated leader b0 is offline. So it will not create the local replica object for p0.
      • It will then start the high water mark checkpoint thread. Since only p1 has a local replica object, only p1's high water mark will be checkpointed to disk. p0's previously written checkpoint if any will be lost.

      So in summary it seems we should always create the local replica object even if the online transition does not happen.

      Possible symptoms of the above bug could be one or more of the following (we saw 2 and 3):

      1. Data loss; yes on a hard-kill data loss is expected, but this can actually cause loss of nearly all data if the broker becomes follower, truncates, and soon after happens to become leader.
      2. High IO on brokers that lose their high water mark then subsequently (on a successful become follower transition) truncate their log to zero and start catching up from the beginning.
      3. If the offsets topic is affected, then offsets can get reset. This is because during an offset load we don't read past the high water mark. So if a water mark is missing then we don't load anything (even if the offsets are there in the log).

      Attachments

        1. KAFKA-1647_2014-10-13_16:38:39.patch
          7 kB
          Jiangjie Qin
        2. KAFKA-1647_2014-10-18_00:26:51.patch
          9 kB
          Jiangjie Qin
        3. KAFKA-1647_2014-10-21_23:08:43.patch
          10 kB
          Jiangjie Qin
        4. KAFKA-1647_2014-10-27_17:19:07.patch
          12 kB
          Jiangjie Qin
        5. KAFKA-1647_2014-10-30_15:07:09.patch
          7 kB
          Jiangjie Qin
        6. KAFKA-1647_2014-10-30_15:10:22.patch
          8 kB
          Jiangjie Qin
        7. KAFKA-1647_2014-10-30_15:38:02.patch
          10 kB
          Jiangjie Qin
        8. KAFKA-1647_2014-10-30_15:50:33.patch
          12 kB
          Jiangjie Qin
        9. KAFKA-1647.patch
          4 kB
          Jiangjie Qin

        Activity

          People

            becket_qin Jiangjie Qin
            jjkoshy Joel Jacob Koshy
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: