Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7802

Connection to Broker Disconnected Taking Down the Whole Cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.1.0
    • None
    • core
    • None

    Description

      We recently upgraded to 2.1.0. Since then, several times per day, we observe some brokers were disconnected when other brokers were trying to fetch the replicas. This issue took down the whole cluster, making all the producers and consumers not able to publish or consume messages. It could be quickly fixed by restarting the problematic broker.

      Here is an example of what we're seeing in the broker which was trying to send fetch request to the problematic one:

      2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read.
      2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) to node 3: java.io.IOException: Connection to 3 was disconnected before the response was read.
      2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[178])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, epoch=1599941))
      java.io.IOException: Connection to 3 was disconnected before the response was read
      at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
      at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
      at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
      at kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
      at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
      at kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
      at scala.Option.foreach(Option.scala:257)
      at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
      at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
      at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

       

       

       Below is the suspicious log of the problematic broker when the issue happened:

      2019-01-09 08:04:50.177 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has failed, rem
      oving it from the group
      2019-01-09 08:04:50.177 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to rebalance group talon-instance1 in state PreparingRebalance with old generation 27
      0 (__consumer_offsets-47) (reason: removing member consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
      2019-01-09 08:04:50.297 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group Notifications.ASIA1546980352799 has failed, removing it from the group
      2019-01-09 08:04:50.297 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance with old generation 1 (__consumer_offsets-44) (reason: removing member consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
      2019-01-09 08:04:50.297 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group Notifications.ASIA1546980352799 with generation 2 is now empty (__consumer_offsets-44)
      2019-01-09 08:04:50.388 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-3-0a4c55c2-9f31-4e7a-b0d7-1f057dceb03d in group talon-instance1 has failed, removing it from the group
      2019-01-09 08:04:50.419 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 in group talon-kafka-vision has failed, removing it from the group
      2019-01-09 08:04:50.419 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to rebalance group talon-kafka-vision in state PreparingRebalance with old generation 9 (__consumer_offsets-26) (reason: removing member consumer-1-f7253f75-c626-47b1-842e-4eca3b0551c4 on heartbeat expiration)
      2019-01-09 08:04:50.419 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group talon-kafka-vision with generation 10 is now empty (__consumer_offsets-26)
      2019-01-09 08:04:50.419 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 in group talon-hades-instance1 has failed, removing it from the group
      2019-01-09 08:04:50.419 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to rebalance group talon-hades-instance1 in state PreparingRebalance with old generation 122 (__consumer_offsets-11) (reason: removing member consumer-2-5e7d051c-be6c-4893-bdaf-16ea180a54d9 on heartbeat expiration)
      2019-01-09 08:04:50.419 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group talon-hades-instance1 with generation 123 is now empty (__consumer_offsets-11)
      2019-01-09 08:04:50.422 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-4-a527e579-7a14-471b-b19d-ffec50074bb8 in group talon-instance1 has failed, removing it from the group
      2019-01-09 08:04:50.434 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-4-0c470e05-5e9a-4cae-a493-9854a6d0c8a7 in group talon-instance1 has failed, removing it from the group
      2019-01-09 08:04:50.514 [executor-Heartbeat] INFO k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member consumer-2-155ea6c8-c90f-4af6-b65e-138a151d77d9 in group talon-instance1 has failed, removing it from the group
      2019-01-09 08:04:55.297 [executor-Produce] WARN k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write empty metadata for group Notifications.ASIA1546980352799: The group is rebalancing, so a rejoin is needed.
      2019-01-09 08:04:55.419 [executor-Produce] WARN k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write empty metadata for group talon-kafka-vision: The group is rebalancing, so a rejoin is needed.
      2019-01-09 08:04:55.420 [executor-Produce] WARN k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Failed to write empty metadata for group talon-hades-instance1: The group is rebalancing, so a rejoin is needed.

       

      We also took the thread dump of the problematic broker (attached). We found all the kafka-request-handler were hanging and waiting for some locks, which seemed to be a resource leak there.

       

      The java version we are running is 11.0.1

      Attachments

        1. thread_dump.log
          95 kB
          Candice Wan
        2. BytesInput.png
          173 kB
          Lubos Hozzan

        Issue Links

          Activity

            People

              Unassigned Unassigned
              candicewan Candice Wan
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: