Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-16997

do not stop kafka when issue to delete a partition folder

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.6.2
    • None
    • core
    • None

    Description

      Context: In our project we create different partitions and even if we delete the segments those remains and it came out we have so many partitions that kafka crashes due to amount of open files. Therefore we want to delete regularly those partitions but we get during that kafka stopping.

       

      The issue: after some investigations we found out that the deletion process gives sometimes warnings if it cannot delete some log files:

      [2024-06-17 15:52:39,590] WARN Failed atomic move of /tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex to /tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex.deleted retrying with a non-atomic move (org.apache.kafka.common.utils.Utils)
      java.nio.file.NoSuchFileException: /tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex -> /tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex.deleted
      	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
      	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
      	at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:416)
      	at java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:266)
      	at java.base/java.nio.file.Files.move(Files.java:1432)
      	at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:980)
      	at org.apache.kafka.storage.internals.log.LazyIndex$IndexFile.renameTo(LazyIndex.java:80)
      	at org.apache.kafka.storage.internals.log.LazyIndex.renameTo(LazyIndex.java:202)
      	at org.apache.kafka.storage.internals.log.LogSegment.changeFileSuffixes(LogSegment.java:666)
      	at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$1(LocalLog.scala:912)
      	at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$1$adapted(LocalLog.scala:910)
      	at scala.collection.immutable.List.foreach(List.scala:431)
      	at kafka.log.LocalLog$.deleteSegmentFiles(LocalLog.scala:910)
      	at kafka.log.LocalLog.removeAndDeleteSegments(LocalLog.scala:289) 

      And just continue but when it is to delete a folder then it mark the replica as not ok and then stop kafka if only replica available (which is our case):

      [2024-06-17 15:52:39,637] ERROR Error while deleting dir for 69747657-f49d-453f-9fa2-4d4369199699-0 in dir /tmp/kafka-logs-mnt/kafka-no-docker (org.apache.kafka.storage.internals.log.LogDirFailureChannel)
      java.nio.file.DirectoryNotEmptyException: /tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete
      	at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246)
      	at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
      	at java.base/java.nio.file.Files.delete(Files.java:1152)
      	at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:923)
      	at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:901)
      	at java.base/java.nio.file.Files.walkFileTree(Files.java:2828)
      	at java.base/java.nio.file.Files.walkFileTree(Files.java:2882)
      	at org.apache.kafka.common.utils.Utils.delete(Utils.java:901)
      	at kafka.log.LocalLog.$anonfun$deleteEmptyDir$2(LocalLog.scala:243)
      	at kafka.log.LocalLog.deleteEmptyDir(LocalLog.scala:709)
      	at kafka.log.UnifiedLog.$anonfun$delete$2(UnifiedLog.scala:1734)
      	at kafka.log.UnifiedLog.delete(UnifiedLog.scala:1911)
      	at kafka.log.LogManager.deleteLogs(LogManager.scala:1152)
      	at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:1166)
      	at org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      	at java.base/java.lang.Thread.run(Thread.java:833)
      [2024-06-17 15:52:39,640] WARN [ReplicaManager broker=0] Stopping serving replicas in dir /tmp/kafka-logs-mnt/kafka-no-docker (kafka.server.ReplicaManager)
      [2024-06-17 15:52:39,640] INFO [LocalLog partition=a11f3352-56fc-4d00-bdf8-f5fee33391f6-0, dir=/tmp/kafka-logs-mnt/kafka-no-docker] Deleting segment files LogSegment(baseOffset=0, size=861, lastModifiedTime=0, largestRecordTimestamp=1718632120826) (kafka.log.LocalLog$)
      [2024-06-17 15:52:39,641] ERROR Uncaught exception in scheduled task 'delete-file' (org.apache.kafka.server.util.KafkaScheduler)
      org.apache.kafka.common.errors.KafkaStorageException: The log dir /tmp/kafka-logs-mnt/kafka-no-docker is already offline due to a previous IO exception.
      [2024-06-17 15:52:39,641] ERROR Exception while deleting Log(dir=/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete, topicId=wohaEWpfTR6HuqDFlcIJYw, topic=69747657-f49d-453f-9fa2-4d4369199699, partition=0, highWatermark=10, lastStableOffset=10, logStartOffset=10, logEndOffset=10) in dir /tmp/kafka-logs-mnt/kafka-no-docker. (kafka.log.LogManager)
      org.apache.kafka.common.errors.KafkaStorageException: Error while deleting dir for 69747657-f49d-453f-9fa2-4d4369199699-0 in dir /tmp/kafka-logs-mnt/kafka-no-docker
      Caused by: java.nio.file.DirectoryNotEmptyException: /tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete
      	at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246)
      	at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
      	at java.base/java.nio.file.Files.delete(Files.java:1152)
      	at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:923)
      	at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:901)
      	at java.base/java.nio.file.Files.walkFileTree(Files.java:2828)
      	at java.base/java.nio.file.Files.walkFileTree(Files.java:2882)
      	at org.apache.kafka.common.utils.Utils.delete(Utils.java:901)
      	at kafka.log.LocalLog.$anonfun$deleteEmptyDir$2(LocalLog.scala:243)
      	at kafka.log.LocalLog.deleteEmptyDir(LocalLog.scala:709)
      	at kafka.log.UnifiedLog.$anonfun$delete$2(UnifiedLog.scala:1734)
      	at kafka.log.UnifiedLog.delete(UnifiedLog.scala:1911)
      	at kafka.log.LogManager.deleteLogs(LogManager.scala:1152)
      	at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:1166)
      	at org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      	at java.base/java.lang.Thread.run(Thread.java:833)
      [2024-06-17 15:52:39,642] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set 

      we tried with different version of kafka (2.8 and 3.7) and it is the same.

      Is there a reason to just put a warning when a file in the partition cannot be deleted but blew up when it is the directory itself that cannot be deleted? Is it possible to also gives a warning when the directory cannot be deleted and just process.

      In our case after restart of kafka all gets deleted as expected (disc glitch issue).

      Remark: our server does not have local storage so we use a network disc and such glitch may happen often.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bulrog59 Jerome Morel
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: