Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7364 Improved container scanning
  3. HDDS-7100

Container scanner incorrectly marks containers unhealthy when DN is shutdown

    XMLWordPrintableJSON

Details

    Description

      When the DN is shutdown, the ContainerDataScanner.shutdown() method is called:

        public synchronized void shutdown() {
          this.stopping = true;
          this.canceler.cancel(
              String.format(NAME_FORMAT, volume) + " is shutting down");
          this.interrupt();
          try {
            this.join();
          } catch (InterruptedException ex) {
            LOG.warn("Unexpected exception while stopping data scanner for volume "
                + volume, ex);
            Thread.currentThread().interrupt();
          }
        }
      

      This interrupts the current thread. The code to scan a container looks like:

        public boolean fullCheck(DataTransferThrottler throttler, Canceler canceler) {
          boolean valid;
      
          try {
            valid = fastCheck();
            if (valid) {
              scanData(throttler, canceler);
            }
          } catch (IOException e) {
            handleCorruption(e);
            valid = false;
          }
      
          return valid;
        }
      

      The interrupt causes the some method further down the stack to thrown an exception, which is then caught by the IOException handler. Right now, it assume any exception is due to the container being unhealthy, and marks the container as such.

      Adding some debug code, we can see the real exception when this occurs is "java.nio.channels.ClosedByInterruptException":

      datanode_1  | 2022-08-05 12:08:51,676 [ContainerDataScanner(/data/hdds/hdds)] INFO keyvalue.KeyValueContainerCheck: IO exception in checker
      datanode_1  | java.nio.channels.ClosedByInterruptException
      datanode_1  | 	at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
      datanode_1  | 	at java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
      datanode_1  | 	at java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:366)
      datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.verifyChecksum(KeyValueContainerCheck.java:295)
      datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanData(KeyValueContainerCheck.java:272)
      datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.fullCheck(KeyValueContainerCheck.java:128)
      datanode_1  | 	at org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.scanData(KeyValueContainer.java:849)
      datanode_1  | 	at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.runIteration(ContainerDataScanner.java:106)
      datanode_1  | 	at org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.run(ContainerDataScanner.java:81)
      

      I am not sure if there could be other type of exception raised, so simply catching ClosedByInterruptException is probably not a good solution. I feel we should raise specific container integrity exceptions if the container should be marked unhealthy, and the catch all IOException probably should not be used.

      Attachments

        Issue Links

          Activity

            People

              erose Ethan Rose
              nilotpalnandi Nilotpal Nandi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: