Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-4269

Ozone DataNode thinks a volume is failed if an unexpected file is in the HDDS root directory

    XMLWordPrintableJSON

Details

    Description

      Took me some time to debug a trivial bug.

      DataNode crashes after this mysterious error and no explanation:

      10:11:44.382 PM	INFO	MutableVolumeSet	Moving Volume : /var/lib/hadoop-ozone/fake_datanode/data/hdds to failed Volumes
      10:11:46.287 PM	ERROR	StateContext	Critical error occurred in StateMachine, setting shutDownMachine
      10:11:46.287 PM	ERROR	DatanodeStateMachine	DatanodeStateMachine Shutdown due to an critical error
      

      Turns out that if there are unexpected files under the hdds directory ($hdds.datanode.dir/hdds), DN thinks the volume is bad and move it to failed volume list, without an error explanation. I was editing the VERSION file and vim created a temp file under the directory. This is impossible to debug without reading the code.

      HddsVolumeUtil#checkVolume()
      } else if(hddsFiles.length == 2) {
            // The files should be Version and SCM directory
            if (scmDir.exists()) {
              return true;
            } else {
              logger.error("Volume {} is in Inconsistent state, expected scm " +
                      "directory {} does not exist", volumeRoot, scmDir
                  .getAbsolutePath());
              return false;
            }
          } else {
            // The hdds root dir should always have 2 files. One is Version file
            // and other is SCM directory.
            <---- HERE!
            return false;
          }
      

      Attachments

        Issue Links

          Activity

            People

              flirmnave Huang-Mu Zheng
              weichiu Wei-Chiu Chuang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: