Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13720

Few topic partitions remain under replicated after broker lose connectivity to zookeeper

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.7.1
    • 3.1.0
    • controller
    • None

    Description

      Few topic partitions remain under replicated after broker lose connectivity to zookeeper.
      It only happens when brokers lose connectivity to zookeeper and it results in change in active controller. Issue does not occur always but randomly.
      Issue never occurs when there is no change in active controller when brokers lose connectivity to zookeeper.
      Following error message i found in the log file.

      [2022-02-28 04:01:20,217] WARN [Partition __consumer_offsets-4 broker=1] Controller failed to update ISR to PendingExpandIsr(isr=Set(1), newInSyncReplicaId=2) due to unexpected UNKNOWN_SERVER_ERROR. Retrying. (kafka.cluster.Partition)
      [2022-02-28 04:01:20,217] ERROR [broker-1-to-controller] Uncaught error in request completion: (org.apache.kafka.clients.NetworkClient)
      java.lang.IllegalStateException: Failed to enqueue `AlterIsr` request with state LeaderAndIsr(leader=1, leaderEpoch=2728, isr=List(1, 2), zkVersion=4719) for partition __consumer_offsets-4
      at kafka.cluster.Partition.sendAlterIsrRequest(Partition.scala:1403)
      at kafka.cluster.Partition.$anonfun$handleAlterIsrResponse$1(Partition.scala:1438)
      at kafka.cluster.Partition.handleAlterIsrResponse(Partition.scala:1417)
      at kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1(Partition.scala:1398)
      at kafka.cluster.Partition.$anonfun$sendAlterIsrRequest$1$adapted(Partition.scala:1398)
      at kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8(AlterIsrManager.scala:166)
      at kafka.server.AlterIsrManagerImpl.$anonfun$handleAlterIsrResponse$8$adapted(AlterIsrManager.scala:163)
      at scala.collection.immutable.List.foreach(List.scala:333)
      at kafka.server.AlterIsrManagerImpl.handleAlterIsrResponse(AlterIsrManager.scala:163)
      at kafka.server.AlterIsrManagerImpl.responseHandler$1(AlterIsrManager.scala:94)
      at kafka.server.AlterIsrManagerImpl.$anonfun$sendRequest$2(AlterIsrManager.scala:104)
      at kafka.server.BrokerToControllerRequestThread.handleResponse(BrokerToControllerChannelManagerImpl.scala:175)
      at kafka.server.BrokerToControllerRequestThread.$anonfun$generateRequests$1(BrokerToControllerChannelManagerImpl.scala:158)
      at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
      at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:586)
      at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:578)
      at kafka.common.InterBrokerSendThread.doWork(InterBrokerSendThread.scala:71)
      at kafka.server.BrokerToControllerRequestThread.doWork(BrokerToControllerChannelManagerImpl.scala:183)
      at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
       
      under replication count goes to zero after the controller broker is restarted again. but this require manual intervention.
      Expectation is that when broker reconnect with zookeeper cluster should come back to stable state with under replication count as zero by itself without any manual intervention.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dhirendraks@gmail.com Dhirendra Singh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment