Description
This is an edge case where the entire log directory is marked offline when we delete a topic. This symptoms of this scenario is characterised by the following logs:
[2023-08-14 09:22:12,600] ERROR Uncaught exception in scheduled task 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152) org.apache.kafka.common.errors.KafkaStorageException: Error while flushing log for test-0 in dir /tmp/kafka-15093588566723278510 with offset 221 (exclusive) and recovery point 221 Caused by: java.nio.file.NoSuchFileException: /tmp/kafka-15093588566723278510/test-0
The above log is followed by logs such as:
[2023-08-14 09:22:12,601] ERROR Uncaught exception in scheduled task 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152)org.apache.kafka.common.errors.KafkaStorageException: The log dir /tmp/kafka-15093588566723278510 is already offline due to a previous IO exception.
The below sequence of events demonstrate the scenario where this bug manifests
1. On the broker, partition lock is acquired and UnifiedLog.roll() is called which schedules an async call for
flushUptoOffsetExclusive(). The roll may be called due to segment rotation time or size.
2. Admin client calls deleteTopic
3. On the broker, LogManager.asyncDelete() is called which will call UnifiedLog.renameDir()
4. The directory for the partition is successfully renamed with a "delete" suffix.
5. The async task scheduled in step 1 (flushUptoOffsetExclusive) starts executing. It tries to call localLog.flush() without acquiring a partition lock.
6. LocalLog calls Utils.flushDir() which fails with an IOException.
7. On IOException, log directory is added to logDirFailureChannel
8. Any new interaction with this logDir fails and a log line is printed such as
"The log dir $logDir is already offline due to a previous IO exception"
This is the reason DeleteTopicTest is flaky as well - https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe/Berlin&tests.container=kafka.admin.DeleteTopicTest&tests.test=testDeleteTopicWithCleaner()
Attachments
Issue Links
- fixes
-
KAFKA-15572 Race condition between log roll and log rename
- Resolved
- is related to
-
KAFKA-13855 FileNotFoundException: Error while rolling log segment for topic partition in dir
- Open
-
KAFKA-13403 KafkaServer crashes when deleting topics due to the race in log deletion
- Resolved
- links to