Kafka
  1. Kafka
  2. KAFKA-1032

Messages sent to the old leader will be lost on broker GC resulted failure

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      As pointed out by Swapnil, today when a broker in on long GC, it will marked by the controller as failed and trigger the onBrokerFailure function to migrate leadership to other brokers. However, since the Controller does not notify the broker with stopReplica request even after a new leader has been elected for its partitions. The new leader will hence stop fetching from the old leader while the old leader is not aware that he is no longer the leader. And since the old leader is not really dead producers will not refresh their metadata immediately and will continue sending messages to the old leader. The old leader will only know it is no longer the leader when it gets notified by controller in the onBrokerStartup function, and message sent starting from the time the new leader is elected to the timestamp the old leader realize it is no longer the leader will be lost.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        21h 38m 1 Guozhang Wang 30/Aug/13 00:31
        Patch Available Patch Available Resolved Resolved
        243d 17h 26m 1 Guozhang Wang 30/Apr/14 17:58
        Resolved Resolved Closed Closed
        13s 1 Guozhang Wang 30/Apr/14 17:59
        Tony Stevenson made changes -
        Workflow Apache Kafka Workflow [ 13051090 ] no-reopen-closed, patch-avail [ 13054533 ]
        Tony Stevenson made changes -
        Workflow no-reopen-closed, patch-avail [ 12813161 ] Apache Kafka Workflow [ 13051090 ]
        Guozhang Wang made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Guozhang Wang made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Guozhang Wang added a comment -

        Confirmed that the StopReplicaRequest is sent to the dead broker, hence closing this ticket.

        Show
        Guozhang Wang added a comment - Confirmed that the StopReplicaRequest is sent to the dead broker, hence closing this ticket.
        Hide
        Jun Rao added a comment -

        It seems that the main thing that this jira is trying to fix--send StopReplicaRequest to dead broker-- is already checked in. Do we still need this jira?

        Show
        Jun Rao added a comment - It seems that the main thing that this jira is trying to fix-- send StopReplicaRequest to dead broker -- is already checked in. Do we still need this jira?
        Jun Rao made changes -
        Fix Version/s 0.9.0 [ 12323928 ]
        Fix Version/s 0.8.1 [ 12322960 ]
        Hide
        Guozhang Wang added a comment -

        I think this can wait until 0.9

        Show
        Guozhang Wang added a comment - I think this can wait until 0.9
        Hide
        Neha Narkhede added a comment -

        Jun Rao, Guozhang Wang Do we want this in 0.8.1?

        Show
        Neha Narkhede added a comment - Jun Rao , Guozhang Wang Do we want this in 0.8.1?
        Jun Rao made changes -
        Fix Version/s 0.8.1 [ 12322960 ]
        Guozhang Wang made changes -
        Attachment KAFKA-1032.v1.patch [ 12600687 ]
        Guozhang Wang made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Guozhang Wang added a comment -

        Note that we also need to delay removal of dead brokers after onBrokerFailure, and inside this function we need to wait until all messages sent by the sender thread.

        Show
        Guozhang Wang added a comment - Note that we also need to delay removal of dead brokers after onBrokerFailure, and inside this function we need to wait until all messages sent by the sender thread.
        Hide
        Neha Narkhede added a comment -

        It seems more natural to send a stop replica request to a failed broker.

        Show
        Neha Narkhede added a comment - It seems more natural to send a stop replica request to a failed broker.
        Hide
        Swapnil Ghike added a comment -

        The problem is that the leader that GC-ed did not receive become-follower request from controller soon enough, so it kept acting like a leader post GC for some time and appended new messages. These messages were lost when the affected broker became a follower.

        The other approach to fix this could involve changing OfflinePartitionLeaderSelector to send LeaderAndIsrRequest to dead brokers, this will ensure that the old leader (if still alive) will stop acting like a leader much sooner.

        Show
        Swapnil Ghike added a comment - The problem is that the leader that GC-ed did not receive become-follower request from controller soon enough, so it kept acting like a leader post GC for some time and appended new messages. These messages were lost when the affected broker became a follower. The other approach to fix this could involve changing OfflinePartitionLeaderSelector to send LeaderAndIsrRequest to dead brokers, this will ensure that the old leader (if still alive) will stop acting like a leader much sooner.
        Hide
        Guozhang Wang added a comment -

        Proposed approach:

        1. Add addStopReplicaRequestForBrokers with deletion = false to handling replica state change to offline. Now this is only triggered by onBrokerFailure and stopOldReplicasOfReassignedPartition.

        2. In shutdownBroker of KafkaController, remove the direct call

        brokerRequestBatch.addStopReplicaRequestForBrokers(Seq(id), topicAndPartition.topic, topicAndPartition.partition, deletePartition = false)

        Show
        Guozhang Wang added a comment - Proposed approach: 1. Add addStopReplicaRequestForBrokers with deletion = false to handling replica state change to offline. Now this is only triggered by onBrokerFailure and stopOldReplicasOfReassignedPartition. 2. In shutdownBroker of KafkaController, remove the direct call brokerRequestBatch.addStopReplicaRequestForBrokers(Seq(id), topicAndPartition.topic, topicAndPartition.partition, deletePartition = false)
        Guozhang Wang created issue -

          People

          • Assignee:
            Guozhang Wang
            Reporter:
            Guozhang Wang
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development