Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7913

Kafka broker halts and messes up the whole cluster

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:
      kafka_2.12-2.1.0,
      openjdk version "11.0.1" 2018-10-16 LTS
      OpenJDK Runtime Environment 18.9 (build 11.0.1+13-LTS),
      CentOS Linux release 7.3.1611 (Core),
      linux 3.10.0-514.26.2.el7.x86_64

      Description

      We upgraded cluster recently and running kafka 2.1.0 on java 11.

      For a time being everything went ok, but then random brokers started to halt from time to time.

      When it happens the broker still looks alive to other brokers, but it stops to receive network traffic. Other brokers then throw IOException:

      java.io.IOException: Connection to 36155 was disconnected before the response was read
              at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
              at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
              at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
              at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:241)
              at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
              at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
              at scala.Option.foreach(Option.scala:257)
              at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
              at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
              at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
      

      On the problematic broker all logging stops. No errors, no exceptions - nothing.

      This also "breaks" all cluster - since clients and other brokers "think" that broker is still alive,

      they are trying to connect to it and it seems that leader election leaves problematic brokers as a leader.

       

      I would be glad to provide any further details if somebody could give an advice what to investigate when it happens next time.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                lazystone Andrej Urvantsev
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: