Details
-
Sub-task
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
When the DN is shutdown, the ContainerDataScanner.shutdown() method is called:
public synchronized void shutdown() { this.stopping = true; this.canceler.cancel( String.format(NAME_FORMAT, volume) + " is shutting down"); this.interrupt(); try { this.join(); } catch (InterruptedException ex) { LOG.warn("Unexpected exception while stopping data scanner for volume " + volume, ex); Thread.currentThread().interrupt(); } }
This interrupts the current thread. The code to scan a container looks like:
public boolean fullCheck(DataTransferThrottler throttler, Canceler canceler) { boolean valid; try { valid = fastCheck(); if (valid) { scanData(throttler, canceler); } } catch (IOException e) { handleCorruption(e); valid = false; } return valid; }
The interrupt causes the some method further down the stack to thrown an exception, which is then caught by the IOException handler. Right now, it assume any exception is due to the container being unhealthy, and marks the container as such.
Adding some debug code, we can see the real exception when this occurs is "java.nio.channels.ClosedByInterruptException":
datanode_1 | 2022-08-05 12:08:51,676 [ContainerDataScanner(/data/hdds/hdds)] INFO keyvalue.KeyValueContainerCheck: IO exception in checker datanode_1 | java.nio.channels.ClosedByInterruptException datanode_1 | at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199) datanode_1 | at java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162) datanode_1 | at java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:366) datanode_1 | at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.verifyChecksum(KeyValueContainerCheck.java:295) datanode_1 | at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanData(KeyValueContainerCheck.java:272) datanode_1 | at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.fullCheck(KeyValueContainerCheck.java:128) datanode_1 | at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.scanData(KeyValueContainer.java:849) datanode_1 | at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.runIteration(ContainerDataScanner.java:106) datanode_1 | at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.run(ContainerDataScanner.java:81)
I am not sure if there could be other type of exception raised, so simply catching ClosedByInterruptException is probably not a good solution. I feel we should raise specific container integrity exceptions if the container should be marked unhealthy, and the catch all IOException probably should not be used.
Attachments
Issue Links
- links to