Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
The symptom is that cluster reports under-replicated blocks and some replicas do not seem to catch up ever. It turns out that the corresponding fetchers in those brokers were stopped shortly after the broker's restart. The broker had stopped the fetcher upon receiving stop-replica request from the controller. The controller had issued those request upon processing controlled shutdown request form the same broker. However those requests were all sent before the broker restart but the controller processed them after. Here is the timeline:
- broker sends controlled shutdown message to controller
- the process fails and the broker proceeds with an unclean shutdown
- the broker is restated
- the controller processes the perviously sent controlled shutdown messages
- the controller sends stop replica messages to the broker
- the broker shuts down the fetchers, while it has no intent to shut down again
- this leads to under-replicated blocks
Example from logs:
broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean shutdown" server.log.2016-07-07.2 2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed... 2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed... 2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed... 2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19], Proceeding to do an unclean shutdown as all the controlled shutdown attempts failed broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3 2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values:
broker13.com:/var/log/kafka$ grep "Shutting down broker 19" controller.log.2016-07-07.1 2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19 2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19 2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
which resulted into many stop replica request to broker 19:
broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false) sent to broker 19 is" controller.log.2016-07-07.1 | tail -1 2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The stop replica request (delete = false) sent to broker 19 is [Topic=topic-xyz,Partition=6,Replica=19]
broker 19 processes them AFTER its restart:
broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for partition .topic-xzy,3." state-change.log.2016-07-07.2 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xzy,3] 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3] 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xyz,3] 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3]
and removes the fetchers:
broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3" server.log.2016-07-07.3 | tail -2 2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3] 2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3]
Attachments
Attachments
Issue Links
- is part of
-
KAFKA-7235 Use brokerZkNodeVersion to prevent broker from processing outdated controller request
- Resolved