Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-979

FSImage should specify which dirs are missing when refusing to come up

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.24.0
    • Component/s: namenode
    • Labels:
      None
    • Release Note:
      Added more verbose error logging in FSImage should there be an issue getting the size of the NameNode data or edits directories.

      Description

      When FSImage can't come up as either it has no data or edit dirs, it tells me this

      java.io.IOException: All specified directories are not accessible or do not exist.
      

      What it doesn't do is say which of the two attributes are missing. This would be beneficial to anyone trying to track down the problem. Also, I don't think the message is correct. It's bailing out because dataDirs.size() == 0 || editsDirs.size() == 0 , because a list is empty -not because the dirs aren't there, as there hasn't been any validation yet.

      More useful would be

      1. Explicit mention of which attributes are null
      2. Declare that this is because they are not in the config
      1. HDFS-979-take3.txt
        5 kB
        Jim Plush
      2. HDFS-979-take2.txt
        5 kB
        Jim Plush
      3. HDFS-979-take1.txt
        5 kB
        Jim Plush

        Activity

        Hide
        Harsh J added a comment -

        Hey Jim, did you get a chance to apply Matt's review comments to the patch?

        Show
        Harsh J added a comment - Hey Jim, did you get a chance to apply Matt's review comments to the patch?
        Hide
        Matt Foley added a comment -

        Sorry for the delay, I was at the Hadoop Summit. Thanks for waiting.
        In general, more specific logging is good. I wanted to check that this enhancement does not overlap with the reporting of bad dataDirs and editsDirs from e.g. the callers of NNStorage#reportErrorsOnDirectories(). I now believe there is no collision, so go for it! Here are some additional comments on this patch:

        FSImage:

        recoverTransitionRead()

        • Please use dataDirs.isEmpty() rather than dataDirs.size() == 0; same for editsDirs. But it might be best to also check that they are non-null before checking for emptiness.
        • The enveloping check for "if (startOpt != StartupOption.IMPORT) {...}

          " concerns me, since the old code did not have this check. It's confusing because this method is called from both loadFSImage() and doImportCheckpoint(), and in either case the dataDirs and editsDirs should be non-null and non-empty. Here's the question: When IMPORT is set, and loadFSImage calls recoverTransitionRead for the FIRST time (before the recursive call from doImportCheckpoint), isn't it important that dataDirs and editsDirs be checked for non-emptiness? I'm not sure, because FSNamesystem#getStorageDirs does have a comment that says "//When importing image from a checkpoint, the name-node can start with empty set of storage directories." But it also logs a warning that this is a really bad idea, since no edits log could be recorded. Can you explain why the old code (without this check) was considered okay?

        • Looking further at FSNamesystem#getStorageDirs, when IMPORT is not set, it seems you'll never get an empty directory list. It does
          if (dirNames.isEmpty())
                dirNames.add("file:///tmp/hadoop/dfs/name");
          

        getEmptyRecoveryIOExceptionMsg()

        • Comments have mis-spellings: editsDirs (not editDirs) and necessarily.
        • Suggest that the dirType strings should be "fsimage" and "edits" rather than "data" and "edit", in call from recoverTransitionRead(). Better yet, since this really should be an enum, why not use the NameNodeDirType values IMAGE and EDITS?
        • There's no need to pass configKey and directoriesToCheck args. Since this is a single-use method, keep the signature clean by fetching them within the getEmptyRecoveryIOExceptionMsg() method, based on the dirType arg.
        • In-line the value of "confValue".
        • The test for dirsCollection.size() == 0 isn't necessary. If dirsCollection isn't empty, this method won't be called, right? But if for some reason it gets called when dirsCollection isn't empty, the log string would be suddenly terminated at "Specifically: ". It's better to just proceed with constructing the log string.
        • In fact, why bother sending dirsCollection as an arg at all? It will always be empty. The dirType has all the info needed.
        • If dir.exists() print "EXISTS BUT IS APPARENTLY INACCESSIBLE", rather than just "EXISTS".
        • Instead of limiting the method to URIs with scheme == "file", can you use the FileContext stuff to test .exists() on the directories regardless of scheme?
        Show
        Matt Foley added a comment - Sorry for the delay, I was at the Hadoop Summit. Thanks for waiting. In general, more specific logging is good. I wanted to check that this enhancement does not overlap with the reporting of bad dataDirs and editsDirs from e.g. the callers of NNStorage#reportErrorsOnDirectories(). I now believe there is no collision, so go for it! Here are some additional comments on this patch: FSImage: recoverTransitionRead() Please use dataDirs.isEmpty() rather than dataDirs.size() == 0; same for editsDirs. But it might be best to also check that they are non-null before checking for emptiness. The enveloping check for "if (startOpt != StartupOption.IMPORT) {...} " concerns me, since the old code did not have this check. It's confusing because this method is called from both loadFSImage() and doImportCheckpoint(), and in either case the dataDirs and editsDirs should be non-null and non-empty. Here's the question: When IMPORT is set, and loadFSImage calls recoverTransitionRead for the FIRST time (before the recursive call from doImportCheckpoint), isn't it important that dataDirs and editsDirs be checked for non-emptiness? I'm not sure, because FSNamesystem#getStorageDirs does have a comment that says "//When importing image from a checkpoint, the name-node can start with empty set of storage directories." But it also logs a warning that this is a really bad idea, since no edits log could be recorded. Can you explain why the old code (without this check) was considered okay? Looking further at FSNamesystem#getStorageDirs, when IMPORT is not set, it seems you'll never get an empty directory list. It does if (dirNames.isEmpty()) dirNames.add( "file: ///tmp/hadoop/dfs/name" ); getEmptyRecoveryIOExceptionMsg() Comments have mis-spellings: editsDirs (not editDirs) and necessarily. Suggest that the dirType strings should be "fsimage" and "edits" rather than "data" and "edit", in call from recoverTransitionRead(). Better yet, since this really should be an enum, why not use the NameNodeDirType values IMAGE and EDITS? There's no need to pass configKey and directoriesToCheck args. Since this is a single-use method, keep the signature clean by fetching them within the getEmptyRecoveryIOExceptionMsg() method, based on the dirType arg. In-line the value of "confValue". The test for dirsCollection.size() == 0 isn't necessary. If dirsCollection isn't empty, this method won't be called, right? But if for some reason it gets called when dirsCollection isn't empty, the log string would be suddenly terminated at "Specifically: ". It's better to just proceed with constructing the log string. In fact, why bother sending dirsCollection as an arg at all? It will always be empty. The dirType has all the info needed. If dir.exists() print "EXISTS BUT IS APPARENTLY INACCESSIBLE", rather than just "EXISTS". Instead of limiting the method to URIs with scheme == "file", can you use the FileContext stuff to test .exists() on the directories regardless of scheme?
        Hide
        Matt Foley added a comment -

        Please hold off on committing for a few hours while I review existing logging that may intersect with this. Thanks.

        Show
        Matt Foley added a comment - Please hold off on committing for a few hours while I review existing logging that may intersect with this. Thanks.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12484010/HDFS-979-take3.txt
        against trunk revision 1140030.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/850//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/850//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/850//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12484010/HDFS-979-take3.txt against trunk revision 1140030. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/850//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/850//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/850//console This message is automatically generated.
        Hide
        Jim Plush added a comment -

        Added suggestions from Steve, on the if statements I "believe" the only thing I was missing was adding a space after the "if". I also split it out as two checks as it seem to make for cleaner code.

        Show
        Jim Plush added a comment - Added suggestions from Steve, on the if statements I "believe" the only thing I was missing was adding a space after the "if". I also split it out as two checks as it seem to make for cleaner code.
        Hide
        Jim Plush added a comment -

        sounds good Steve, I'll refactor items 1,2 and 3. I do think 3 would probably clean some of the logic up so I'll take a crack at it.

        Also, for #1(code formatting) are you referring to the Sun standards for if spacing or is there another doc I should take a look at?

        Show
        Jim Plush added a comment - sounds good Steve, I'll refactor items 1,2 and 3. I do think 3 would probably clean some of the logic up so I'll take a crack at it. Also, for #1(code formatting) are you referring to the Sun standards for if spacing or is there another doc I should take a look at?
        Hide
        Steve Loughran added a comment -

        seems reasonable and is close to committing. Some observations

        1. the code would need some minor reformatting to meet the normal Hadoop style rules (primarily spacing in if() clauses)
        2. the test shouldn't call Throwable.getMessage(), and then compare it, as some exceptions return null there. Safer to call toString() and assert on that, and include text in the assertions for better failure diagnosis
        3. One bigger change would be should the initial condition be broken up into two -one for dataDirs and one for editDirs? In favour of this -it's easier to see what is at fault, but against it users may end up fixing one bug and then go straight into the next one. It's probably simpler to leave the test as is, with Jim's extra diagnostics.
        Show
        Steve Loughran added a comment - seems reasonable and is close to committing. Some observations the code would need some minor reformatting to meet the normal Hadoop style rules (primarily spacing in if() clauses) the test shouldn't call Throwable.getMessage(), and then compare it, as some exceptions return null there. Safer to call toString() and assert on that, and include text in the assertions for better failure diagnosis One bigger change would be should the initial condition be broken up into two -one for dataDirs and one for editDirs? In favour of this -it's easier to see what is at fault, but against it users may end up fixing one bug and then go straight into the next one. It's probably simpler to leave the test as is, with Jim's extra diagnostics.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12483882/HDFS-979-take2.txt
        against trunk revision 1139715.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/845//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/845//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/845//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483882/HDFS-979-take2.txt against trunk revision 1139715. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/845//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/845//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/845//console This message is automatically generated.
        Hide
        Jim Plush added a comment -

        Updated to use a StringBuilder in the loop after complaints from Findbugs report.

        Show
        Jim Plush added a comment - Updated to use a StringBuilder in the loop after complaints from Findbugs report.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12483878/HDFS-979-take1.txt
        against trunk revision 1139715.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/844//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/844//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/844//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483878/HDFS-979-take1.txt against trunk revision 1139715. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/844//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/844//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/844//console This message is automatically generated.
        Hide
        Jim Plush added a comment -

        Adding a verbose (the more verbose the more helpful?) Exception message should either the dataDirs or the editDirs show a size of 0 in the recoverTransitionRead method. It will also loop through the directories and if they are a file scheme, check to make sure those directories actually do exist on disk.

        For example if the Edit dirs was showing an empty list and the directory did not exist locally the message would be:

        "All specified directories are not accessible or do not exist. Specifically: FSImage reports 0 NameNode edit dirs. The config value of 'dfs.namenode.edits.dir' is: file:///opt/hadoop-doesnotexist/dfs/name. Directory: /opt/hadoop-doesnotexist/dfs/name DOES NOT EXIST."

        Show
        Jim Plush added a comment - Adding a verbose (the more verbose the more helpful?) Exception message should either the dataDirs or the editDirs show a size of 0 in the recoverTransitionRead method. It will also loop through the directories and if they are a file scheme, check to make sure those directories actually do exist on disk. For example if the Edit dirs was showing an empty list and the directory did not exist locally the message would be: "All specified directories are not accessible or do not exist. Specifically: FSImage reports 0 NameNode edit dirs. The config value of 'dfs.namenode.edits.dir' is: file:///opt/hadoop-doesnotexist/dfs/name . Directory: /opt/hadoop-doesnotexist/dfs/name DOES NOT EXIST."
        Hide
        Jim Plush added a comment -

        There seem to be several other places where a generic message like this is displayed...

        ------------------------------------------------------------------------
        ➺ Ack --java "All specified directories are not accessible"
        hdfs/src/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java
        134: "All specified directories are not accessible or do not exist.");

        hdfs/src/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
        188: "All specified directories are not accessible or do not exist.");

        hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
        201: "All specified directories are not accessible or do not exist.");

        Show
        Jim Plush added a comment - There seem to be several other places where a generic message like this is displayed... ------------------------------------------------------------------------ ➺ Ack --java "All specified directories are not accessible" hdfs/src/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java 134: "All specified directories are not accessible or do not exist."); hdfs/src/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java 188: "All specified directories are not accessible or do not exist."); hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java 201: "All specified directories are not accessible or do not exist.");

          People

          • Assignee:
            Jim Plush
            Reporter:
            Steve Loughran
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development