Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9672

Dead brokers in ISR cause isr-expiration to fail with exception



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0, 2.4.1
    • 3.0.0
    • core
    • None


      We're running Kafka 2.4 and facing a pretty strange situation.
      Let's say there were three brokers in the cluster 0, 1, and 2. Then:
      1. Broker 3 was added.
      2. Partitions were reassigned from broker 0 to broker 3.
      3. Broker 0 was shut down (not gracefully) and removed from the cluster.
      4. We see the following state in ZooKeeper:

      ls /brokers/ids
      [1, 2, 3]
      get /brokers/topics/foo
      get /brokers/topics/foo/partitions/0/state

      It means, the dead broker 0 remains in the partitions's ISR. A big share of the partitions in the cluster have this issue.

      This is actually causing an errors:

      Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
      org.apache.kafka.common.errors.ReplicaNotAvailableException: Replica with id 12 is not available on broker 17

      It means that effectively isr-expiration task is not working any more.

      I have a suspicion that this was introduced by this commit (line selected)

      Unfortunately, I haven't been able to reproduce this in isolation.

      Any hints about how to reproduce (so I can write a patch) or mitigate the issue on a running cluster are welcome.

      Generally, I assume that not throwing ReplicaNotAvailableException on a dead (i.e. non-existent) broker, considering them out-of-sync and removing from the ISR should fix the problem.



        Issue Links



              jagsancio Jose Armando Garcia Sancio
              ivanyu Ivan Yurchenko
              0 Vote for this issue
              6 Start watching this issue