Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2702

A single failed name dir can cause the NN to exit

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.2
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      There's a bug in FSEditLog#rollEditLog which results in the NN process exiting if a single name dir has failed. Here's the relevant code:

      close()  // So editStreams.size() is 0 
      foreach edits dir {
        ..
        eStream = new ...  // Might get an IOE here
        editStreams.add(eStream);
      } catch (IOException ioe) {
        removeEditsForStorageDir(sd);  // exits if editStreams.size() <= 1  
      }
      

      If we get an IOException before we've added two edits streams to the list we'll exit, eg if there's an error processing the 1st name dir we'll exit even if there are 4 valid name dirs. The fix is to move the checking out of removeEditsForStorageDir (nee processIOError) or modify it so it can be disabled in some cases, eg here where we don't yet know how many streams are valid.

      1. hdfs-2702.txt
        4 kB
        Eli Collins
      2. hdfs-2702.txt
        11 kB
        Eli Collins
      3. hdfs-2702.txt
        12 kB
        Eli Collins
      4. hdfs-2702.txt
        12 kB
        Eli Collins
      5. hdfs-2702.txt
        13 kB
        Eli Collins

        Issue Links

          Activity

          Hide
          Eli Collins added a comment -

          This bug exists in FSEditLog#close as well.

          Show
          Eli Collins added a comment - This bug exists in FSEditLog#close as well.
          Hide
          Eli Collins added a comment -

          Patch attached. Depends on HDFS-2701 and HDFS-2703. I verified the NN no longer exits after checkpointing when a single dir failed. I'm working on a unit test for this jira and 2703 but thought I'd put up a fix for review in the meantime.

          Show
          Eli Collins added a comment - Patch attached. Depends on HDFS-2701 and HDFS-2703 . I verified the NN no longer exits after checkpointing when a single dir failed. I'm working on a unit test for this jira and 2703 but thought I'd put up a fix for review in the meantime.
          Hide
          Eli Collins added a comment -

          And verified the NN still exits after all storage directories have failed.

          Show
          Eli Collins added a comment - And verified the NN still exits after all storage directories have failed.
          Hide
          Eli Collins added a comment -

          Updated patch with new test class that covers:
          #1 The NN doesn't exit as long as it has a valid storage dir
          #2 The NN exits when it no longer has a valid storage dir
          #3 Removed storage dirs is updated (fails w/o HDFS-2703)

          Show
          Eli Collins added a comment - Updated patch with new test class that covers: #1 The NN doesn't exit as long as it has a valid storage dir #2 The NN exits when it no longer has a valid storage dir #3 Removed storage dirs is updated (fails w/o HDFS-2703 )
          Hide
          Eli Collins added a comment -

          Slightly updated patch. I made FSEditLog#logEdit throw an AssertionError (rather than just assert) so we stop the NN if there's a bug where we forget to remove an edit stream after we notice a failed directory. This should never fire, but could if we introduced a bug where eg we missed a call to removeEdits. Updated the test to check that we can't log an edit if there are no streams.

          Show
          Eli Collins added a comment - Slightly updated patch. I made FSEditLog#logEdit throw an AssertionError (rather than just assert) so we stop the NN if there's a bug where we forget to remove an edit stream after we notice a failed directory. This should never fire, but could if we introduced a bug where eg we missed a call to removeEdits. Updated the test to check that we can't log an edit if there are no streams.
          Hide
          Todd Lipcon added a comment -
          • in fatalExit, can you change it to:
            FSNamesystem.LOG.faral(msg, new Exception(msg));
            

            so that we get a stacktrace in the logs?

          • in exitIfNoStreams use isEmpty instead of comparing size() == 0
          • rather than an if...throw AssertionError maybe just use the Preconditions.checkState function from guava? Or is guava not in branch-1 yet? (can't remember)
          • instead of calling exitIfNoStreams everywhere, maybe removeEditsForStorageDir can just call it whenever it removes one?

          Otherwise looks good.

          Show
          Todd Lipcon added a comment - in fatalExit , can you change it to: FSNamesystem.LOG.faral(msg, new Exception(msg)); so that we get a stacktrace in the logs? in exitIfNoStreams use isEmpty instead of comparing size() == 0 rather than an if...throw AssertionError maybe just use the Preconditions.checkState function from guava? Or is guava not in branch-1 yet? (can't remember) instead of calling exitIfNoStreams everywhere, maybe removeEditsForStorageDir can just call it whenever it removes one? Otherwise looks good.
          Hide
          Eli Collins added a comment -

          #1-2 Good suggestions. Done.
          #3 Yea, no guava in branch-1.
          #4 That's what causes the bug (on log roll when we close all and add streams we'll exit the NN when we don't mean to). I considered folding the check into the removeEdits and making it conditional (so we don't trigger it in log roll) but there are a number of places where we want to trigger this check where we are not removing edits so it seemed cleaner to always call it explicitly.

          Show
          Eli Collins added a comment - #1-2 Good suggestions. Done. #3 Yea, no guava in branch-1. #4 That's what causes the bug (on log roll when we close all and add streams we'll exit the NN when we don't mean to). I considered folding the check into the removeEdits and making it conditional (so we don't trigger it in log roll) but there are a number of places where we want to trigger this check where we are not removing edits so it seemed cleaner to always call it explicitly.
          Hide
          Todd Lipcon added a comment -

          Oh, right. duh Thanks, +1.

          Show
          Todd Lipcon added a comment - Oh, right. duh Thanks, +1.
          Hide
          Eli Collins added a comment -

          Thanks Todd. Minor update to previous patch. It's not technically part of this change but good to fix at the same time. The question in the current comment in purgeEditLog is valid, let's fix that. Note that we won't and don't want to bail out since we've just closed all the logs.

          -          // Should we also remove from edits
          +          sd.unlock();
          +          removeEditsForStorageDir(sd);
                     fsimage.updateRemovedDirs(sd, null);
                     it.remove();
          
          Show
          Eli Collins added a comment - Thanks Todd. Minor update to previous patch. It's not technically part of this change but good to fix at the same time. The question in the current comment in purgeEditLog is valid, let's fix that. Note that we won't and don't want to bail out since we've just closed all the logs. - // Should we also remove from edits + sd.unlock(); + removeEditsForStorageDir(sd); fsimage.updateRemovedDirs(sd, null ); it.remove();
          Hide
          Todd Lipcon added a comment -

          I think the addendum is correct. I wish we had a fault injection point for each of these cases, but that's a larger amount of work... something like how we use ErrorSimulator for errors in checkpoints would be great.

          Show
          Todd Lipcon added a comment - I think the addendum is correct. I wish we had a fault injection point for each of these cases, but that's a larger amount of work... something like how we use ErrorSimulator for errors in checkpoints would be great.
          Hide
          Eli Collins added a comment -

          Agree, we'd need something like ErrorSimulator to cover this case, given that this is all re-written and tested well in trunk I don't I think for branch-1 it's not worth the energy.

          Show
          Eli Collins added a comment - Agree, we'd need something like ErrorSimulator to cover this case, given that this is all re-written and tested well in trunk I don't I think for branch-1 it's not worth the energy.
          Hide
          Eli Collins added a comment -

          Tests and test-patch are clean. I've committed this.

          Thanks for the review Todd!

          Show
          Eli Collins added a comment - Tests and test-patch are clean. I've committed this. Thanks for the review Todd!

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development