Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4208

NameNode could be stuck in SafeMode due to never-created blocks

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.1.2
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      In one test case, NameNode allocated a block and then was killed before the client got the addBlock response. After NameNode restarted, it couldn't get out of SafeMode waiting for the block which was never created. In trunk, NameNode can get out of SafeMode since it only counts complete blocks. However branch-1 doesn't have the clear notion of under-constructioned-block in Namenode.

      JIRA HDFS-4212 is to track the never-created-block issue and this JIRA is to fix NameNode in branch-1 so it can get out of SafeMode when never-created-block exists.

      The proposed idea is for SafeMode not to count the zero-sized last block in an under-construction file as part of total blcok count.

      1. HDFS-4208.branch-1.patch
        6 kB
        Brandon Li
      2. HDFS-4208.branch-1.patch
        5 kB
        Brandon Li
      3. HDFS-4208.branch-1.patch
        4 kB
        Brandon Li

        Issue Links

          Activity

          Hide
          Uma Maheswara Rao G added a comment -

          Hi Brandon,

          What about fsync'ed blocks. They may not be in completed state right?

          Regards,
          Uma

          Show
          Uma Maheswara Rao G added a comment - Hi Brandon, What about fsync'ed blocks. They may not be in completed state right? Regards, Uma
          Hide
          Brandon Li added a comment -

          test-patch result:

          -1 overall.  
              +1 @author.  The patch does not contain any @author tags.
              +1 tests included.  The patch appears to include 3 new or modified tests.
              +1 javadoc.  The javadoc tool did not generate any warning messages.
              +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
              -1 findbugs.  The patch appears to introduce 200 new Findbugs (version 2.0.0) warnings.
          

          This patch doesn't introduce new findbugs warnings.

          Show
          Brandon Li added a comment - test-patch result: -1 overall. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 200 new Findbugs (version 2.0.0) warnings. This patch doesn't introduce new findbugs warnings.
          Hide
          Brandon Li added a comment -

          Hi Uma, thanks for reviewing the patch. Yes, fsync'ed blocks may not be in complete state.
          Let me update the JIRA title and description to better clarify the problem.

          Show
          Brandon Li added a comment - Hi Uma, thanks for reviewing the patch. Yes, fsync'ed blocks may not be in complete state. Let me update the JIRA title and description to better clarify the problem.
          Hide
          Suresh Srinivas added a comment -

          Comments:

          1. getCompleteBlocksTotal()
            • Should we name this as getSafeBlockCount()?
            • javadoc could be "Get the total number of COMPLETE blocks in the system. The last block of files that are under construction that are zero length are not counted towards the COMPLETE blocks. This avoids newly created blocks that might "
            • Instead of adding assert, we should count under construction blocks only if the inode is not null and is underconstruction. The current code could have class cast exception or NPE and might results in namenode not coming up, in case some weird errors.
            • We should also add to the LOG.info number of safeblocks and total blocks.
          2. To make the test case more clear, please add safeblock threshold to the configuration.
          Show
          Suresh Srinivas added a comment - Comments: getCompleteBlocksTotal() Should we name this as getSafeBlockCount()? javadoc could be "Get the total number of COMPLETE blocks in the system. The last block of files that are under construction that are zero length are not counted towards the COMPLETE blocks. This avoids newly created blocks that might " Instead of adding assert, we should count under construction blocks only if the inode is not null and is underconstruction. The current code could have class cast exception or NPE and might results in namenode not coming up, in case some weird errors. We should also add to the LOG.info number of safeblocks and total blocks. To make the test case more clear, please add safeblock threshold to the configuration.
          Hide
          Brandon Li added a comment -

          Updated the patch to address Suresh's feedbacks.

          Show
          Brandon Li added a comment - Updated the patch to address Suresh's feedbacks.
          Hide
          Suresh Srinivas added a comment -

          Some minor edits:

          1. javadoc of the getSafeBlockCount() method:
            • "There are times when blocks are allocated by a client but was never used to" -> "There are times when a block is allocated by a client but is never used to"
            • "NameNode might get stuck in safemode waiting" -> "NameNode might get stuck in safemode in subsequent restart waiting"
            • "last blocks of file under construction" -> "last block of file under construction"
          2. In the log - "Number of blocks excluded by SafeMode" -> "Number of blocks excluded in safeblock count "
          Show
          Suresh Srinivas added a comment - Some minor edits: javadoc of the getSafeBlockCount() method: "There are times when blocks are allocated by a client but was never used to" -> "There are times when a block is allocated by a client but is never used to" "NameNode might get stuck in safemode waiting" -> "NameNode might get stuck in safemode in subsequent restart waiting" "last blocks of file under construction" -> "last block of file under construction" In the log - "Number of blocks excluded by SafeMode" -> "Number of blocks excluded in safeblock count "
          Hide
          Suresh Srinivas added a comment -

          With the above changes +1 for the patch.

          Show
          Suresh Srinivas added a comment - With the above changes +1 for the patch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks a lot for the explanation Brandon. Good catch!
          Latest revision looks great to me as well.

          Some minor comments adding to Suresh points above, please take care of them in next revision.

          • I think we can make getSafeBlockCount as private, as we are not using it anywhere as of now
          • typo: numExculdedBlocks --> numExcludedBlocks
          • I am also having a suspect that, When there are no blocks added yet(just create a file), lease path will be there but blocks will be just 0. So, we are having only null check here, do you think we should have length check as well to avoid AIOBE? I did not check this in detail, please have a test and verify once. [ create a file and don't add any blocks, let's be emtry blocks list then restart NN ]
          • BTW, please add licence header to test file as you are touching that file anyway.

          Regards,
          Uma

          Show
          Uma Maheswara Rao G added a comment - Thanks a lot for the explanation Brandon. Good catch! Latest revision looks great to me as well. Some minor comments adding to Suresh points above, please take care of them in next revision. I think we can make getSafeBlockCount as private, as we are not using it anywhere as of now typo: numExculdedBlocks --> numExcludedBlocks I am also having a suspect that, When there are no blocks added yet(just create a file), lease path will be there but blocks will be just 0. So, we are having only null check here, do you think we should have length check as well to avoid AIOBE? I did not check this in detail, please have a test and verify once. [ create a file and don't add any blocks, let's be emtry blocks list then restart NN ] BTW, please add licence header to test file as you are touching that file anyway. Regards, Uma
          Hide
          Suresh Srinivas added a comment -

          nice catch Uma.

          Show
          Suresh Srinivas added a comment - nice catch Uma.
          Hide
          Brandon Li added a comment -

          Thanks folks for the review. I uploaded the patch addressing the new comments.

          Show
          Brandon Li added a comment - Thanks folks for the review. I uploaded the patch addressing the new comments.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch.

          Show
          Suresh Srinivas added a comment - +1 for the patch.
          Hide
          Suresh Srinivas added a comment -

          I committed the patch to branch-1 and branch-1.1.

          Thank you Brandon!

          Show
          Suresh Srinivas added a comment - I committed the patch to branch-1 and branch-1.1. Thank you Brandon!
          Hide
          Suresh Srinivas added a comment -

          Looks like I committed the wrong version of the patch. I will revert the change and commit the right version of the patch.

          Show
          Suresh Srinivas added a comment - Looks like I committed the wrong version of the patch. I will revert the change and commit the right version of the patch.
          Hide
          Suresh Srinivas added a comment -

          I committed the correct patch to branch-1.

          Show
          Suresh Srinivas added a comment - I committed the correct patch to branch-1.
          Hide
          Matt Foley added a comment -

          Closed upon successful release of 1.1.2.

          Show
          Matt Foley added a comment - Closed upon successful release of 1.1.2.

            People

            • Assignee:
              Brandon Li
              Reporter:
              Brandon Li
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development