Kafka
  1. Kafka
  2. KAFKA-1447

Controlled shutdown deadlock when trying to send state updates

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: controller
    • Labels:

      Description

      We're seeing controlled shutdown indefinitely stuck on trying to send out state change messages to the other brokers:

      [2014-05-03 04:01:30,580] INFO [Socket Server on Broker 4], Shutdown completed (kafka.network.SocketServer)
      [2014-05-03 04:01:30,581] INFO [Kafka Request Handler on Broker 4], shutting down (kafka.server.KafkaRequestHandlerPool)

      and stuck on:

      "kafka-request-handler-12" daemon prio=10 tid=0x00007f1f04a66800 nid=0x6e79 waiting on condition [0x00007f1ad5767000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)
      parking to wait for <0x000000078e91dc20> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
      at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
      at kafka.controller.ControllerChannelManager.sendRequest(ControllerChannelManager.scala:57)
      locked <0x000000078e91dc38> (a java.lang.Object)
      at kafka.controller.KafkaController.sendRequest(KafkaController.scala:655)
      at kafka.controller.ControllerBrokerRequestBatch$$anonfun$sendRequestsToBrokers$2.apply(ControllerChannelManager.scala:298)
      at kafkler.ControllerBrokerRequestBatch$$anonfun$sendRequestsToBrokers$2.apply(ControllerChannelManager.scala:290)
      at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
      at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
      at scala.collection.Iterator$class.foreach(Iterator.scala:772)
      at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
      at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
      at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
      at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
      at kafka.controller.ControllerBrokerRequestBatch.sendRequestsToBrokers(ControllerChannelManager.scala:290)
      at kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:97)
      at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply(KafkaController.scala:269)
      at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply(KafkaController.scala:253)
      at scala.Option.foreach(Option.scala:197)
      at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply$mcV$sp(KafkaController.scala:253)
      at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply(KafkaController.scala:253)
      at kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply(KafkaController.scala:253)
      at kafka.utils.Utils$.inLock(Utils.scala:538)
      at kafka.controller.KafkaController$$anonfun$shutdownBroker$3.apply(KafkaController.scala:252)
      at kafka.controller.KafkaController$$anonfun$shutdownBroker$3.apply(KafkaController.scala:249)
      at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:130)
      at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:275)
      at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:275)
      at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:275)
      at kafka.controller.KafkaController.shutdownBroker(KafkaController.scala:249)
      locked <0x000000078b495af0> (a java.lang.Object)
      at kafka.server.KafkaApis.handleControlledShutdownRequest(KafkaApis.scala:264)
      at kafka.server.KafkaApis.handle(KafkaApis.scala:192)
      at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
      at java.lang.Thread.run(Thread.java:722)

        Activity

        Sam Meder created issue -
        Hide
        Guozhang Wang added a comment -

        Hi Sam, which version are you using?

        Show
        Guozhang Wang added a comment - Hi Sam, which version are you using?
        Hide
        Sam Meder added a comment -

        0.8.0

        Show
        Sam Meder added a comment - 0.8.0
        Hide
        Guozhang Wang added a comment -

        This may be due a known issue in 0.8.0, could you try with 0.8.1.1?

        Show
        Guozhang Wang added a comment - This may be due a known issue in 0.8.0, could you try with 0.8.1.1?
        Hide
        Sam Meder added a comment -

        We'll be rolling out 0.8.1.1 soon, but since we can't reproduce this easily it is going to be hard to validate. I did look through any of the Jira issues related to controlled shutdown to see if this was already addressed and didn't see any that seemed to match this situation. I'll report back once we have run with 0.8.1.1 for a while.

        Show
        Sam Meder added a comment - We'll be rolling out 0.8.1.1 soon, but since we can't reproduce this easily it is going to be hard to validate. I did look through any of the Jira issues related to controlled shutdown to see if this was already addressed and didn't see any that seemed to match this situation. I'll report back once we have run with 0.8.1.1 for a while.
        Hide
        Joel Koshy added a comment -

        Do you still have the preceding log and see if controlled shutdown actually succeeded or it ran out of retries?
        Was broker 4 the controller at the time it was shut down? It could be that it did not finish sending all the state change requests to itself and will never finish because the socket server has been shut down.
        Also, do you have the full stack trace?

        Show
        Joel Koshy added a comment - Do you still have the preceding log and see if controlled shutdown actually succeeded or it ran out of retries? Was broker 4 the controller at the time it was shut down? It could be that it did not finish sending all the state change requests to itself and will never finish because the socket server has been shut down. Also, do you have the full stack trace?
        Hide
        Sam Meder added a comment -

        I don't have those logs anymore, should have grabbed all of it - my bad.

        I do believe broker 4 was the controller at the time and yes, my guess is that it had something to do with sending requests to itself.

        That is the full stack trace (from the thread dump), there were no stack-traces in the logs.

        Show
        Sam Meder added a comment - I don't have those logs anymore, should have grabbed all of it - my bad. I do believe broker 4 was the controller at the time and yes, my guess is that it had something to do with sending requests to itself. That is the full stack trace (from the thread dump), there were no stack-traces in the logs.
        Hide
        Rudolf Šíma added a comment - - edited

        The bug seems to be still present in 0.8.2. We ran into the issue when bouncing 18 brokers at once with controlled shutdown enabled, which led to this kind of deadlock. As a workaround, we have increased controller.message.queue.size to 10000 (10 is default). Are there any pitfalls of using large controller message queue sizes?

        Show
        Rudolf Šíma added a comment - - edited The bug seems to be still present in 0.8.2. We ran into the issue when bouncing 18 brokers at once with controlled shutdown enabled, which led to this kind of deadlock. As a workaround, we have increased controller.message.queue.size to 10000 (10 is default). Are there any pitfalls of using large controller message queue sizes?
        Neha Narkhede made changes -
        Field Original Value New Value
        Priority Major [ 3 ] Critical [ 2 ]
        Neha Narkhede made changes -
        Labels newbie++
        Neha Narkhede made changes -
        Assignee Neha Narkhede [ nehanarkhede ]
        Hide
        Neha Narkhede added a comment -

        Rudolf Šíma, It would help immensely if you can share the controller logs and the entire thread dump when you observe this issue.

        Show
        Neha Narkhede added a comment - Rudolf Šíma , It would help immensely if you can share the controller logs and the entire thread dump when you observe this issue.
        Hide
        Jun Rao added a comment -

        Rudolf,

        Controlled shutdown is designed for rolling bounces.

        Show
        Jun Rao added a comment - Rudolf, Controlled shutdown is designed for rolling bounces.
        Hide
        Jay Kreps added a comment -

        Does this problem still exist?

        Show
        Jay Kreps added a comment - Does this problem still exist?
        Hide
        Jiangjie Qin added a comment -

        I think KAFKA-1305 solved this issue.

        Show
        Jiangjie Qin added a comment - I think KAFKA-1305 solved this issue.
        Tony Stevenson made changes -
        Workflow no-reopen-closed, patch-avail [ 12861864 ] Apache Kafka Workflow [ 13051463 ]
        Tony Stevenson made changes -
        Workflow Apache Kafka Workflow [ 13051463 ] no-reopen-closed, patch-avail [ 13053774 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Sam Meder
          • Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development