[KAFKA-3944] After the broker restart, fetchers stopped due to a delayed controlled shutdown message - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: core
Labels:
- reliability

Description

The symptom is that cluster reports under-replicated blocks and some replicas do not seem to catch up ever. It turns out that the corresponding fetchers in those brokers were stopped shortly after the broker's restart. The broker had stopped the fetcher upon receiving stop-replica request from the controller. The controller had issued those request upon processing controlled shutdown request form the same broker. However those requests were all sent before the broker restart but the controller processed them after. Here is the timeline:

broker sends controlled shutdown message to controller
the process fails and the broker proceeds with an unclean shutdown
the broker is restated
the controller processes the perviously sent controlled shutdown messages
the controller sends stop replica messages to the broker
the broker shuts down the fetchers, while it has no intent to shut down again
this leads to under-replicated blocks

Example from logs:

broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean shutdown" server.log.2016-07-07.2 
2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19], Proceeding to do an unclean shutdown as all the controlled shutdown attempts failed
broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3
2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values:

broker13.com:/var/log/kafka$ grep "Shutting down broker 19" controller.log.2016-07-07.1 
2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19

which resulted into many stop replica request to broker 19:

broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false) sent to broker 19 is" controller.log.2016-07-07.1 | tail -1
2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The stop replica request (delete = false) sent to broker 19 is [Topic=topic-xyz,Partition=6,Replica=19]

broker 19 processes them AFTER its restart:

broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for partition .topic-xzy,3." state-change.log.2016-07-07.2 
2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xzy,3]
2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3]
2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xyz,3]
2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3]

and removes the fetchers:

broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3" server.log.2016-07-07.3 | tail -2
2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3]
2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3]

Attachments

Issue Links

is part of

KAFKA-7235 Use brokerZkNodeVersion to prevent broker from processing outdated controller request

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Maysam Yabandeh

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 09/Jul/16 23:21

Updated:: 30/Dec/19 22:21

Resolved:: 30/Dec/19 22:21