Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-3944

After the broker restart, fetchers stopped due to a delayed controlled shutdown message

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • None
    • core

    Description

      The symptom is that cluster reports under-replicated blocks and some replicas do not seem to catch up ever. It turns out that the corresponding fetchers in those brokers were stopped shortly after the broker's restart. The broker had stopped the fetcher upon receiving stop-replica request from the controller. The controller had issued those request upon processing controlled shutdown request form the same broker. However those requests were all sent before the broker restart but the controller processed them after. Here is the timeline:

      1. broker sends controlled shutdown message to controller
      2. the process fails and the broker proceeds with an unclean shutdown
      3. the broker is restated
      4. the controller processes the perviously sent controlled shutdown messages
      5. the controller sends stop replica messages to the broker
      6. the broker shuts down the fetchers, while it has no intent to shut down again
      7. this leads to under-replicated blocks

      Example from logs:

      broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean shutdown" server.log.2016-07-07.2 
      2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
      2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
      2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
      2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19], Proceeding to do an unclean shutdown as all the controlled shutdown attempts failed
      broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3
      2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values: 
      
      broker13.com:/var/log/kafka$ grep "Shutting down broker 19" controller.log.2016-07-07.1 
      2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
      2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
      2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
      

      which resulted into many stop replica request to broker 19:

      broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false) sent to broker 19 is" controller.log.2016-07-07.1 | tail -1
      2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The stop replica request (delete = false) sent to broker 19 is [Topic=topic-xyz,Partition=6,Replica=19]
      

      broker 19 processes them AFTER its restart:

      broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for partition .topic-xzy,3." state-change.log.2016-07-07.2 
      2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xzy,3]
      2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3]
      2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xyz,3]
      2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3]
      

      and removes the fetchers:

      broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3" server.log.2016-07-07.3 | tail -2
      2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3]
      2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3]
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              maysamyabandeh Maysam Yabandeh
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: