Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-6565

Sudden unresponsiveness from broker + file descriptor leak

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Duplicate
    • 0.10.1.0
    • None
    • core
    • None

    Description

      Running 3 kafka clusters v0.10.1.0  on aws, 5 brokers each (+5 zookeepers), in the past week about 2-3 times a day (on all clusters, no specific brokers) we encounter a situation in which a single broker stops responding to all requests from clients/other brokers. at this moment all produced messages fail, and other brokers start writing this to log repeatedly:

      [2018-02-15 17:56:08,949] WARN [ReplicaFetcherThread-0-878], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@772ad40c (kafka.server.ReplicaFetcherThread)
      java.io.IOException: Connection to 878 was disconnected before the response was read
      at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
      at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
      at scala.Option.foreach(Option.scala:257)
      at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
      at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
      at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
      at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
      at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
      at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
      at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
      at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
      at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
      at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
      at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

       

      producers (examples):

       2018-02-13 21:30:51:010 thread=[kafka-producer-network-thread | catalog-index-0] WARN  Got error produce response with correlation id 8147 on topic-partition catalog-1, retrying (49 attempts left). Error: REQUEST_TIMED_OUT

      2018-02-13 21:31:06:221 thread=[kafka-producer-network-thread | catalog-index-0] WARN  Got error produce response with correlation id 8166 on topic-partition catalog-index-18, retrying (48 attempts left). Error: NOT_LEADER_FOR_PARTITION

       

      Once this happens, file descriptors on the affected broker start increasing at a high rate until it reaches its maximum.

      no relevant info logs on the affected broker, we'll try to gather debug logs and attach them to the bug next time it happens.

      the big issue here is that although the broker is unresponsive, it stays in the cluster and its zookeeper node isn't cleared, so there's no rebalance whatsoever - all clients start failing until a manual shutdown occurs.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              berman Noam Berman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: