[KAFKA-4477] Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.10.1.0
Fix Version/s: 0.10.1.1
Component/s: core
Labels:
- reliability
Environment:
RHEL7

java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

Description

We have encountered a critical issue that has re-occured in different physical environments. We haven't worked out what is going on. We do though have a nasty work around to keep service alive.

We do have not had this issue on clusters still running 0.9.01.

We have noticed a node randomly shrinking for the partitions it owns the ISR's down to itself, moments later we see other nodes having disconnects, followed by finally app issues, where producing to these partitions is blocked.

It seems only by restarting the kafka instance java process resolves the issues.

We have had this occur multiple times and from all network and machine monitoring the machine never left the network, or had any other glitches.

Below are seen logs from the issue.

Node 7:
[2016-12-01 07:01:28,112] INFO Partition [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 1,2,7 to 7 (kafka.cluster.Partition)

All other nodes:
[2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 7 was disconnected before the response was read

All clients:
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.

After this occurs, we then suddenly see on the sick machine an increasing amount of close_waits and file descriptors.

As a work around to keep service we are currently putting in an automated process that tails and regex's for: and where new_partitions hit just itself we restart the node.

"[(?P<time>.)] INFO Partition [.] on broker . Shrinking ISR for partition [.*] from (?P<old_partitions>.) to (?P<new_partitions>.+) (kafka.cluster.Partition)"

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2016_12_15.zip
15/Dec/16 10:28
1.44 MB
Michael Andre Pearce
72_Server_Thread_Dump.txt
28/Apr/17 09:26
62 kB
Arpan
73_Server_Thread_Dump.txt
28/Apr/17 09:26
62 kB
Arpan
74_Server_Thread_Dump
28/Apr/17 09:26
58 kB
Arpan
issue_node_1001_ext.log
12/Dec/16 22:44
487 kB
Tom DeVoe
issue_node_1001.log
09/Dec/16 17:06
555 kB
Tom DeVoe
issue_node_1002_ext.log
12/Dec/16 22:44
193 kB
Tom DeVoe
issue_node_1002.log
09/Dec/16 17:06
162 kB
Tom DeVoe
issue_node_1003_ext.log
12/Dec/16 22:44
489 kB
Tom DeVoe
issue_node_1003.log
09/Dec/16 17:06
507 kB
Tom DeVoe
kafka.jstack
05/Dec/16 10:23
75 kB
Michael Andre Pearce
server_1_72server.log
28/Apr/17 09:07
39 kB
Arpan
server_2_73_server.log
28/Apr/17 09:07
41 kB
Arpan
server_3_74Server.log
28/Apr/17 09:07
40 kB
Arpan
state_change_controller.tar.gz
13/Dec/16 18:50
11 kB
Tom DeVoe

Activity

People

Assignee:: Apurva Mehta

Reporter:: Michael Andre Pearce

Votes:: 10 Vote for this issue

Watchers:: 35 Start watching this issue

Dates

Created:: 02/Dec/16 10:44

Updated:: 14/Jan/19 08:17

Resolved:: 05/Jan/17 12:30