Kafka
  1. Kafka
  2. KAFKA-1134

onControllerFailover function should be synchronized with other functions

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0, 0.8.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Otherwise race conditions could happen. For example, handleNewSession will close all sockets with brokers while the handleStateChange in onControllerFailover tries to send requests to them.

      1. KAFKA-1134.patch
        3 kB
        Guozhang Wang
      2. KAFKA-1134_2013-12-05_11:13:33.patch
        3 kB
        Guozhang Wang

        Activity

        Hide
        Neha Narkhede added a comment -

        Thanks for the patch. Committed to trunk

        Show
        Neha Narkhede added a comment - Thanks for the patch. Committed to trunk
        Hide
        Guozhang Wang added a comment -

        After checking the stack trace again, now I think the problem is that

        1) In KafkaController.handleNewSession

        controllerContext.controllerLock synchronized {
        Utils.unregisterMBean(KafkaController.MBeanName)
        partitionStateMachine.shutdown()
        replicaStateMachine.shutdown()
        if(controllerContext.controllerChannelManager != null)

        { controllerContext.controllerChannelManager.shutdown() controllerContext.controllerChannelManager = null }

        controllerElector.elect
        }

        elect function is called directly after controllerChannelManager.shutdown and is lock covered by controllerContext.controllerLock, however from the logs. elect is not immediately called since addpartition listener gets triggered due to ZK expiration (known issue similar as KAFKA-1143) and which are covered by the same lock:

        2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread], Stopped
        2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread], Shutdown completed
        2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Shutting down
        2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Stopped
        2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Shutdown completed
        2013/11/14 00:00:24.603 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on Controller 583]: Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589
        2013/11/14 00:00:24.605 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on Controller 583]: Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589
        2013/11/14 00:00:24.614 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583]: Add Partition triggered { "partitions":

        { "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }

        , "version":1 } for path /brokers/topics/databus2-relay-log_event
        2013/11/14 00:00:24.616 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583]: New partitions to be added [Map()]
        2013/11/14 00:00:24.616 [KafkaController] [Controller 583]: New partition creation callback for
        2013/11/14 00:00:24.618 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583]: Add Partition triggered { "partitions":

        { "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }

        , "version":1 } for path /brokers/topics/databus2-relay-log_event

        ----------------

        Without other logging info I cannot deduce any further, so I propose in this jira we just improve the logging info for better debugging if this issue comes up in the future.

        Show
        Guozhang Wang added a comment - After checking the stack trace again, now I think the problem is that 1) In KafkaController.handleNewSession controllerContext.controllerLock synchronized { Utils.unregisterMBean(KafkaController.MBeanName) partitionStateMachine.shutdown() replicaStateMachine.shutdown() if(controllerContext.controllerChannelManager != null) { controllerContext.controllerChannelManager.shutdown() controllerContext.controllerChannelManager = null } controllerElector.elect } elect function is called directly after controllerChannelManager.shutdown and is lock covered by controllerContext.controllerLock, however from the logs. elect is not immediately called since addpartition listener gets triggered due to ZK expiration (known issue similar as KAFKA-1143 ) and which are covered by the same lock: 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread] , Stopped 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread] , Shutdown completed 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread] , Shutting down 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread] , Stopped 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread] , Shutdown completed 2013/11/14 00:00:24.603 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on Controller 583] : Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589 2013/11/14 00:00:24.605 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on Controller 583] : Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589 2013/11/14 00:00:24.614 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583] : Add Partition triggered { "partitions": { "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] } , "version":1 } for path /brokers/topics/databus2-relay-log_event 2013/11/14 00:00:24.616 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583] : New partitions to be added [Map()] 2013/11/14 00:00:24.616 [KafkaController] [Controller 583] : New partition creation callback for 2013/11/14 00:00:24.618 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583] : Add Partition triggered { "partitions": { "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] } , "version":1 } for path /brokers/topics/databus2-relay-log_event ---------------- Without other logging info I cannot deduce any further, so I propose in this jira we just improve the logging info for better debugging if this issue comes up in the future.
        Hide
        Guozhang Wang added a comment -

        Updated reviewboard https://reviews.apache.org/r/15953/
        against branch origin/trunk

        Show
        Guozhang Wang added a comment - Updated reviewboard https://reviews.apache.org/r/15953/ against branch origin/trunk
        Hide
        Neha Narkhede added a comment -

        Guozhang Wang Do you happen to have a stack trace of the problem you have described?

        Show
        Neha Narkhede added a comment - Guozhang Wang Do you happen to have a stack trace of the problem you have described?
        Hide
        Guozhang Wang added a comment -

        Created reviewboard https://reviews.apache.org/r/15953/
        against branch origin/trunk

        Show
        Guozhang Wang added a comment - Created reviewboard https://reviews.apache.org/r/15953/ against branch origin/trunk

          People

          • Assignee:
            Guozhang Wang
            Reporter:
            Guozhang Wang
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development