Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15339

Transient I/O error happening in appending records could lead to the halt of whole cluster

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • connect, producer
    • None

    Description

      We are running an integration test in which we start an Embedded Connect Cluster in the active 3.5 branch. However, because of transient disk error, we may encounter an IOException during appending records to one topic. As shown in the stack trace: 

      [2023-08-13 16:53:51,016] ERROR Error while appending records to connect-config-topic-connect-cluster-0 in dir /tmp/EmbeddedKafkaCluster8003464883598783225 (org.apache.kafka.storage.internals.log.LogDirFailureChannel:61)
      java.io.IOException: 
              at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:92)
              at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
              at kafka.log.LogSegment.append(LogSegment.scala:161)
              at kafka.log.LocalLog.append(LocalLog.scala:436)
              at kafka.log.UnifiedLog.append(UnifiedLog.scala:853)
              at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:664)
              at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1281)
              at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1269)
              at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:977)
              at scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
              at scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
              at scala.collection.mutable.HashMap.map(HashMap.scala:35)
              at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:965)
              at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:623)
              at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:680)
              at kafka.server.KafkaApis.handle(KafkaApis.scala:180)
              at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:76)
              at java.lang.Thread.run(Thread.java:748) 

      However, just because of failing to append the records to one partition. The fetcher for all the other partitions are removed, broker shutdown, and finally embedded connect cluster killed as whole. 

      [2023-08-13 17:35:37,966] WARN Stopping serving logs in dir /tmp/EmbeddedKafkaCluster6777164631574762227 (kafka.log.LogManager:70)
      [2023-08-13 17:35:37,968] ERROR Shutdown broker because all log dirs in /tmp/EmbeddedKafkaCluster6777164631574762227 have failed (kafka.log.LogManager:143)
      [2023-08-13 17:35:37,968] WARN Abrupt service halt with code 1 and message null (org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster:130)
      [2023-08-13 17:35:37,968] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler:135)
      org.apache.kafka.connect.util.clusters.UngracefulShutdownException: Abrupt service halt with code 1 and message null 

      I am wondering if we could add configurable retry around the root cause to tolerate the possible I/O faults so that if the retry is successful, the embedded connect cluster could still operate. 

      Any comments and suggestions would be appreciated.

      Attachments

        Activity

          People

            Unassigned Unassigned
            functioner Haoze Wu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: