Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15391

Delete topic may lead to directory offline

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.6.0, 3.5.2
    • core
    • None

    Description

      This is an edge case where the entire log directory is marked offline when we delete a topic. This symptoms of this scenario is characterised by the following logs:

      [2023-08-14 09:22:12,600] ERROR Uncaught exception in scheduled task 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152)  org.apache.kafka.common.errors.KafkaStorageException: Error while flushing log for test-0 in dir /tmp/kafka-15093588566723278510 with offset 221 (exclusive) and recovery point 221 Caused by: java.nio.file.NoSuchFileException: /tmp/kafka-15093588566723278510/test-0

      The above log is followed by logs such as:

      [2023-08-14 09:22:12,601] ERROR Uncaught exception in scheduled task 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152)org.apache.kafka.common.errors.KafkaStorageException: The log dir /tmp/kafka-15093588566723278510 is already offline due to a previous IO exception.

      The below sequence of events demonstrate the scenario where this bug manifests
      1.  On the broker, partition lock is acquired and UnifiedLog.roll() is called which schedules an async call for 
      flushUptoOffsetExclusive(). The roll may be called due to segment rotation time or size.
      2. Admin client calls deleteTopic
      3. On the broker, LogManager.asyncDelete() is called which will call UnifiedLog.renameDir()
      4. The directory for the partition is successfully renamed with a "delete" suffix.
      5. The async task scheduled in step 1 (flushUptoOffsetExclusive) starts executing. It tries to call localLog.flush() without acquiring a partition lock. 
      6. LocalLog calls Utils.flushDir() which fails with an IOException.
      7. On IOException, log directory is added to logDirFailureChannel
      8. Any new interaction with this logDir fails and a log line is printed such as 
      "The log dir $logDir is already offline due to a previous IO exception"
       

      This is the reason DeleteTopicTest is flaky as well - https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe/Berlin&tests.container=kafka.admin.DeleteTopicTest&tests.test=testDeleteTopicWithCleaner()

      Attachments

        Issue Links

          Activity

            People

              ocadaruma Haruki Okada
              divijvaidya Divij Vaidya
              Divij Vaidya Divij Vaidya
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: