Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.6.1
-
None
-
None
Description
Hello,
We faced an issue when one of Kafka broker in cluster has failed with an exception and restarted:
[2022-04-13T09:51:44,563][ERROR][category=kafka.server.LogDirFailureChannel] Error while rolling log segment for prod_data_topic-7 in dir /var/opt/kafka/data/1 java.io.FileNotFoundException: /var/opt/kafka/data/1/prod_data_topic-7/00000000000026872377.index (No such file or directory) at java.base/java.io.RandomAccessFile.open0(Native Method) at java.base/java.io.RandomAccessFile.open(Unknown Source) at java.base/java.io.RandomAccessFile.<init>(Unknown Source) at java.base/java.io.RandomAccessFile.<init>(Unknown Source) at kafka.log.AbstractIndex.$anonfun$resize$1(AbstractIndex.scala:183) at kafka.log.AbstractIndex.resize(AbstractIndex.scala:176) at kafka.log.AbstractIndex.$anonfun$trimToValidSize$1(AbstractIndex.scala:242) at kafka.log.AbstractIndex.trimToValidSize(AbstractIndex.scala:242) at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:508) at kafka.log.Log.$anonfun$roll$8(Log.scala:1916) at kafka.log.Log.$anonfun$roll$2(Log.scala:1916) at kafka.log.Log.roll(Log.scala:2349) at kafka.log.Log.maybeRoll(Log.scala:1865) at kafka.log.Log.$anonfun$append$2(Log.scala:1169) at kafka.log.Log.append(Log.scala:2349) at kafka.log.Log.appendAsLeader(Log.scala:1019) at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:984) at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:972) at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$4(ReplicaManager.scala:883) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:273) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:273) at scala.collection.TraversableLike.map$(TraversableLike.scala:266) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:871) at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:571) at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:605) at kafka.server.KafkaApis.handle(KafkaApis.scala:132) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:70) at java.base/java.lang.Thread.run(Unknown Source) [2022-04-13T09:51:44,812][ERROR][category=kafka.log.LogManager] Shutdown broker because all log dirs in /var/opt/kafka/data/1 have failed
There are no any additional useful information in logs, just one warn before this error:
[2022-04-13T09:51:44,720][WARN][category=kafka.server.ReplicaManager] [ReplicaManager broker=1] Broker 1 stopped fetcher for partitions __consumer_offsets-22,prod_data_topic-5,__consumer_offsets-30, .... prod_data_topic-0 and stopped moving logs for partitions because they are in the failed log directory /var/opt/kafka/data/1. [2022-04-13T09:51:44,720][WARN][category=kafka.log.LogManager] Stopping serving logs in dir /var/opt/kafka/data/1
The topic configuration is:
/opt/kafka $ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic prod_data_topic Topic: prod_data_topic PartitionCount: 12 ReplicationFactor: 3 Configs: min.insync.replicas=2,segment.bytes=1073741824,max.message.bytes=15728640,retention.bytes=4294967296 Topic: prod_data_topic Partition: 0 Leader: 3 Replicas: 3,1,2 Isr: 3,2,1 Topic: prod_data_topic Partition: 1 Leader: 1 Replicas: 1,2,3 Isr: 3,2,1 Topic: prod_data_topic Partition: 2 Leader: 2 Replicas: 2,3,1 Isr: 3,2,1 Topic: prod_data_topic Partition: 3 Leader: 3 Replicas: 3,2,1 Isr: 3,2,1 Topic: prod_data_topic Partition: 4 Leader: 1 Replicas: 1,3,2 Isr: 3,2,1 Topic: prod_data_topic Partition: 5 Leader: 2 Replicas: 2,1,3 Isr: 3,2,1 Topic: prod_data_topic Partition: 6 Leader: 3 Replicas: 3,2,1 Isr: 3,2,1 Topic: prod_data_topic Partition: 7 Leader: 1 Replicas: 1,3,2 Isr: 3,2,1 Topic: prod_data_topic Partition: 8 Leader: 2 Replicas: 2,1,3 Isr: 3,2,1 Topic: prod_data_topic Partition: 9 Leader: 3 Replicas: 3,1,2 Isr: 3,2,1 Topic: prod_data_topic Partition: 10 Leader: 1 Replicas: 1,2,3 Isr: 3,2,1 Topic: prod_data_topic Partition: 11 Leader: 2 Replicas: 2,3,1 Isr: 3,2,1
Previously (a day before it happened) we have set "rettention.bytes" broker config to: 5368709120 (previously the values was 6442450944). But not sure it affected. Current custom broker config:
log.retention.check.interval.ms=300000 log.segment.bytes=1073741824 log.retention.bytes=4294967296 log.retention.hours=40 message.max.bytes=15728640 replica.lag.time.max.ms=30000 min.insync.replicas=2 delete.topic.enable=true replica.fetch.max.bytes=15728640 default.replication.factor=3 num.replica.fetchers=2
Could you please help to investigate what could be a reason of this fail? Because we don't have any ideas (there were no cleaning topics, files or other maintenance procedure with disk).
Attachments
Issue Links
- relates to
-
KAFKA-15391 Delete topic may lead to directory offline
- Resolved