Description
There is a small bug/typo in the handling of I/O error when writing broker metadata checkpoint in KafkaServer. The path provided to the log dir failure channel is the full path of the checkpoint file whereas only the log directory is expected (source).
case e: IOException => val dirPath = checkpoint.file.getAbsolutePath logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing meta.properties to $dirPath", e)
As a result, after an IOException is captured and enqueued in the log dir failure channel (<logDir> is to be replaced with the actual path of the log directory):
[2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to <logDir>/meta.properties (kafka.server.LogDirFailureChannel) java.io.IOException
The log dir failure handler cannot lookup the log directory:
[2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to (kafka.server.ReplicaManager$LogDirFailureHandler) org.apache.kafka.common.errors.LogDirNotFoundException: Log dir <logDir>/meta.properties is not found in the config.
An immediate fix for this is to use the logDir provided from to the checkpointing method instead of the path of the metadata file.
For brokers with only one log directory, this bug will result in preventing the broker from shutting down as expected.
The LogDirNotFoundException then kills the log dir failure handler thread, and subsequent IOException are not handled, and the broker never stops.
[2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped (kafka.server.ReplicaManager$LogDirFailureHandler)
Another consideration here is whether the LogDirNotFoundException should terminate the log dir failure handler thread.
Attachments
Issue Links
- links to