Hadoop Common
  1. Hadoop Common
  2. HADOOP-4597

Under-replicated blocks are not calculated if the name-node is forced out of safe-mode.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently during name-node startup under-replicated blocks are not added to the neededReplications queue until the name-node leaves safe mode. This is an optimization since otherwise all blocks will first go into the under-replicated queue and then most of them will be removed from it.
      When the name-node leaves safe-mode automatically it checks all blocks to have a correct number of replicas (processMisReplicatedBlocks()).
      When the name-node leaves safe-mode manually it does not perform the checkup.
      In the latter case all under-replicated blocks remain not replicated forever because there is no alternative mechanism to trigger replications.
      The proposal is to call processMisReplicatedBlocks() any time the name-node leaves safe mode - automatically or manually.
      In addition to solving that problem this could be an alternative mechanism for refreshing neededReplications and excessReplicateMap sets.

      1. NeededRepl-18.patch
        0.5 kB
        Konstantin Shvachko
      2. NeededRepl.patch
        0.6 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12393414/NeededRepl.patch
          against trunk revision 711734.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393414/NeededRepl.patch against trunk revision 711734. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          I did manual testing, which confirms the change works as suspected.

          1. Create a new file system containing a few files by starting name-node and 2 data-nodes, and loading a couple of files into it. Then stop the cluster.
          2. Start name-node with dfs.safemode.threshold.pct = 1.1
          3. Start one data-node, which contains exactly one copy of each block.
          4. Call dfsadmin -metasave tmp.txt. File tmp.txt will show that there is 0 "Blocks waiting for replication:".
          5. Call dfsadmin -safemode leave. The name-node will leave safe-mode.
          6. Call dfsadmin -metasave tmp.txt. File tmp.txt will show that the number of "Blocks waiting for replication:" > 0,
            and will list all blocks of the file system because they are all under-replicated.

          Without the patch the last step would still show "Blocks waiting for replication: 0".

          Show
          Konstantin Shvachko added a comment - I did manual testing, which confirms the change works as suspected. Create a new file system containing a few files by starting name-node and 2 data-nodes, and loading a couple of files into it. Then stop the cluster. Start name-node with dfs.safemode.threshold.pct = 1.1 Start one data-node, which contains exactly one copy of each block. Call dfsadmin -metasave tmp.txt . File tmp.txt will show that there is 0 "Blocks waiting for replication:". Call dfsadmin -safemode leave . The name-node will leave safe-mode. Call dfsadmin -metasave tmp.txt . File tmp.txt will show that the number of "Blocks waiting for replication:" > 0, and will list all blocks of the file system because they are all under-replicated. Without the patch the last step would still show "Blocks waiting for replication: 0".
          Hide
          Raghu Angadi added a comment -


          Does the call to leaveSafeMode() in checkMode() also need to pass 'true' for second arg?

          Show
          Raghu Angadi added a comment - Does the call to leaveSafeMode() in checkMode() also need to pass 'true' for second arg?
          Hide
          Konstantin Shvachko added a comment -

          Yes, we are going to always verify misreplicated blocks then.
          I am removing the boolean parameter then, since it always has the same value true.

          Show
          Konstantin Shvachko added a comment - Yes, we are going to always verify misreplicated blocks then. I am removing the boolean parameter then, since it always has the same value true.
          Hide
          Konstantin Shvachko added a comment -

          I'll fix Raghu's issue in subsequent issue.

          Show
          Konstantin Shvachko added a comment - I'll fix Raghu's issue in subsequent issue.
          Hide
          Konstantin Shvachko added a comment -

          I just committed this.

          Show
          Konstantin Shvachko added a comment - I just committed this.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #654 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/654/)
          . Calculate mis-replicated blocks when safe-mode is turned of manually. Contributed by Konstantin Shvachko.

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #654 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/654/ ) . Calculate mis-replicated blocks when safe-mode is turned of manually. Contributed by Konstantin Shvachko.

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development