Affects Version/s: 1.1.1, 2.0.0
Fix Version/s: None
we have faced with the next issue - some replicas cannot become in-sync. Distribution of in-sync replicas amongst topics is random. For instance:
Files in segment TEST-7 are equal (the same md5sum) on all 3 brokers. Also were checked by kafka.tools.DumpLogSegments - messages are the same.
We have 3-broker cluster configuration with Confluent Kafka 5.0.0 (it's Apache Kafka 2.0.0).
Each broker has the next configuration:
- initially was working Confluent version 3.2.1 (Kakfa 0.10.2)
- we updated Confluent image to 4.1.1 (Kafka 1.1.1) according to https://docs.confluent.io/4.1.1/installation/upgrade.html
- after a few days one of Kafka broker was restarted. Since that cluster starts working strangely - broker 0 often was absent in ISR.
We have RF=3 for all topics and most topics had only 2 ISR while some of them had all 3 ISR.
Unfortunately, cannot exactly point the moment after that this happened.
Steps were done trying to fix this issue:
- restarted all 3 brokers in rolling manner. Each time cluster controller was restarted. After that an issue transferred to broker 1 instead of 0
- changed replica.lag.time.max.ms: 10s -> 30s
- changed num.replica.fetchers: 1 -> 4
- changed num.network.threads: 3 -> 8
- because often preferred replica was not a leader, kafka-preferred-replica-election was running for all topics. It was done a few times
- CP version was upgraded to 5.0.0 (Kafka 2.0.0)
- changed zookeeper.session.timeout.ms: 6000 -> 60000
- changed replica.fetch.wait.max.ms: 500 -> 5000
Any ideas how to fix it (excluding restarts of brokers)?
Many thanks in advance!