[HDDS-7100] Container scanner incorrectly marks containers unhealthy when DN is shutdown - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: Ozone Datanode
Labels:
- pull-request-available

Description

When the DN is shutdown, the ContainerDataScanner.shutdown() method is called:

  public synchronized void shutdown() {
    this.stopping = true;
    this.canceler.cancel(
        String.format(NAME_FORMAT, volume) + " is shutting down");
    this.interrupt();
    try {
      this.join();
    } catch (InterruptedException ex) {
      LOG.warn("Unexpected exception while stopping data scanner for volume "
          + volume, ex);
      Thread.currentThread().interrupt();
    }
  }

This interrupts the current thread. The code to scan a container looks like:

  public boolean fullCheck(DataTransferThrottler throttler, Canceler canceler) {
    boolean valid;

    try {
      valid = fastCheck();
      if (valid) {
        scanData(throttler, canceler);
      }
    } catch (IOException e) {
      handleCorruption(e);
      valid = false;
    }

    return valid;
  }

The interrupt causes the some method further down the stack to thrown an exception, which is then caught by the IOException handler. Right now, it assume any exception is due to the container being unhealthy, and marks the container as such.

Adding some debug code, we can see the real exception when this occurs is "java.nio.channels.ClosedByInterruptException":

datanode_1  | 2022-08-05 12:08:51,676 [ContainerDataScanner(/data/hdds/hdds)] INFO keyvalue.KeyValueContainerCheck: IO exception in checker
datanode_1  | java.nio.channels.ClosedByInterruptException
datanode_1  | 	at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
datanode_1  | 	at java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
datanode_1  | 	at java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:366)
datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.verifyChecksum(KeyValueContainerCheck.java:295)
datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanData(KeyValueContainerCheck.java:272)
datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.fullCheck(KeyValueContainerCheck.java:128)
datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.scanData(KeyValueContainer.java:849)
datanode_1  | 	at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.runIteration(ContainerDataScanner.java:106)
datanode_1  | 	at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.run(ContainerDataScanner.java:81)

I am not sure if there could be other type of exception raised, so simply catching ClosedByInterruptException is probably not a good solution. I feel we should raise specific container integrity exceptions if the container should be marked unhealthy, and the catch all IOException probably should not be used.

Attachments

Issue Links

links to

GitHub Pull Request #4951

Activity

People

Assignee:: Ethan Rose

Reporter:: Nilotpal Nandi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Aug/22 20:24

Updated:: 27/Jun/23 02:29

Resolved:: 27/Jun/23 02:29