Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.2
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Target Version/s:

      Description

      Let's rename the various "processIOError" methods to be more descriptive. The current code makes it difficult to identify and reason about bug fixes. While we're at it let's remove "Fatal" from the "Unable to sync the edit log" log since it's not actually a fatal error (this is confusing to users). And 2NN "Checkpoint done" should be info, not a warning (also confusing to users).

      Thanks to HDFS-1073 these issues don't exist on trunk or 23.

      1. hdfs-2701.txt
        12 kB
        Eli Collins
      2. hdfs-2701.txt
        18 kB
        Eli Collins
      3. hdfs-2701.txt
        18 kB
        Eli Collins
      4. hdfs-2701.txt
        19 kB
        Eli Collins

        Issue Links

          Activity

          Hide
          Eli Collins added a comment -

          Tests and test-patch are clean. I've committed this.

          Show
          Eli Collins added a comment - Tests and test-patch are clean. I've committed this.
          Hide
          Todd Lipcon added a comment -

          Ah, right. I was thinking about trunk wrt restoreFailedStorage. +1 then

          Show
          Todd Lipcon added a comment - Ah, right. I was thinking about trunk wrt restoreFailedStorage. +1 then
          Hide
          Eli Collins added a comment -

          #1 This is handled in 2702 (see latest patch)
          #2 restoreFailedStorage isn't in branch-1 so not a issue here

          Show
          Eli Collins added a comment - #1 This is handled in 2702 (see latest patch) #2 restoreFailedStorage isn't in branch-1 so not a issue here
          Hide
          Todd Lipcon added a comment -

          in open(), if all of them fail to open, we'll have no edits streams... is that taken care of by 2702?


          in removeEditsForStorageDir, I think there might be a bug with the following sequence:

          • dir holding both edits and image fails
          • restoreFailedStorage is called so it is added back to the list for image operations, but edit logs haven't rolled yet, so it's not in editStreams
          • it fails again, so removeEditsForStorageDir is called with a dir that doesn't have any open stream.

          In that case, exitIfInvalidStreams() would exit even though nothing is getting removed.

          I guess this is taken care of by HDFS-2702?


          If the answer to both of the above is yes, then +1

          Show
          Todd Lipcon added a comment - in open(), if all of them fail to open, we'll have no edits streams... is that taken care of by 2702? in removeEditsForStorageDir, I think there might be a bug with the following sequence: dir holding both edits and image fails restoreFailedStorage is called so it is added back to the list for image operations, but edit logs haven't rolled yet, so it's not in editStreams it fails again, so removeEditsForStorageDir is called with a dir that doesn't have any open stream. In that case, exitIfInvalidStreams() would exit even though nothing is getting removed. I guess this is taken care of by HDFS-2702 ? If the answer to both of the above is yes, then +1
          Hide
          Eli Collins added a comment -

          Thanks for the review Todd. Updated patch attach.

          #1 Agree, I've done this in HDFS-2702, I was trying to keep this change to just cleanup/refactoring (the current crazy behavior is actually what causes HDFS-2702!).
          #2 Good catch. Fixed.
          #3-5 Done.

          Wrt testing see my comment in HDFS-2702. The short answer is that aside from the existing tests which are clean I've done manual testing (failing storage dirs and checkpointing) for 2701-2703 and am working on a unit test that will cover storage dir failures and removal.

          Show
          Eli Collins added a comment - Thanks for the review Todd. Updated patch attach. #1 Agree, I've done this in HDFS-2702 , I was trying to keep this change to just cleanup/refactoring (the current crazy behavior is actually what causes HDFS-2702 !). #2 Good catch. Fixed. #3-5 Done. Wrt testing see my comment in HDFS-2702 . The short answer is that aside from the existing tests which are clean I've done manual testing (failing storage dirs and checkpointing) for 2701-2703 and am working on a unit test that will cover storage dir failures and removal.
          Hide
          Todd Lipcon added a comment -
          • The behavior of exitIfInvalidStreams is extremely counter-intuitive... why don't you change it to check for an empty list, and just change the call site to call after the errored dir is removed?

          In removeEditsAndStorageDir:

          +    editStreams.remove(idx);
          +    fsimage.removeStorageDir(getStorageDirForStream(idx));
          

          I don't think this is correct – because getStorageDirForStream is called after it's removed from editStreams, it will remove the one that came after the stream in the storage dir list (or throw an ArrayIndexOutOfBounds if it was the last stream)


          In removeEditsStreamsAndStorageDirs, you can use a foreach loop instead of indexed iteration:

          +    for (int i = 0; i < errorStreams.size(); i++) {
          +      int idx = editStreams.indexOf(errorStreams.get(i));
          

          +        FSNamesystem.LOG.error("Unable to sync edit log");
          

          we should probably include the path of the failed stream here


          +          throw new IOException(
          +              "Inconsistent existance of edits.new " + editsNew);
          

          spelling error - should be "existence"


          What's the test plan for this, HDFS-2702, and HDFS-2703? I agree its buggy but we should articulate a way to make sure we fixed all the issues and didn't introduce new ones.

          Show
          Todd Lipcon added a comment - The behavior of exitIfInvalidStreams is extremely counter-intuitive... why don't you change it to check for an empty list, and just change the call site to call after the errored dir is removed? In removeEditsAndStorageDir: + editStreams.remove(idx); + fsimage.removeStorageDir(getStorageDirForStream(idx)); I don't think this is correct – because getStorageDirForStream is called after it's removed from editStreams, it will remove the one that came after the stream in the storage dir list (or throw an ArrayIndexOutOfBounds if it was the last stream) In removeEditsStreamsAndStorageDirs , you can use a foreach loop instead of indexed iteration: + for ( int i = 0; i < errorStreams.size(); i++) { + int idx = editStreams.indexOf(errorStreams.get(i)); + FSNamesystem.LOG.error( "Unable to sync edit log" ); we should probably include the path of the failed stream here + throw new IOException( + "Inconsistent existance of edits. new " + editsNew); spelling error - should be "existence" What's the test plan for this, HDFS-2702 , and HDFS-2703 ? I agree its buggy but we should articulate a way to make sure we fixed all the issues and didn't introduce new ones.
          Hide
          Eli Collins added a comment -

          Minor update.

          Show
          Eli Collins added a comment - Minor update.
          Hide
          Eli Collins added a comment -

          test-patch results. Just cleanup so existing tests suffice. Will do a full test run before committing.

               [exec] 
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
               [exec]                         Please justify why no tests are needed for this patch.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
               [exec] 
          
          Show
          Eli Collins added a comment - test-patch results. Just cleanup so existing tests suffice. Will do a full test run before committing. [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec]
          Hide
          Eli Collins added a comment -

          Patch attached. Running test-patch now.

          Show
          Eli Collins added a comment - Patch attached. Running test-patch now.

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development