Issue Details (XML | Word | Printable)

Key: HADOOP-4597
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Konstantin Shvachko
Reporter: Konstantin Shvachko
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Under-replicated blocks are not calculated if the name-node is forced out of safe-mode.

Created: 06/Nov/08 12:07 AM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: 0.18.0
Fix Version/s: 0.18.3

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works NeededRepl-18.patch 2008-11-06 08:01 PM Konstantin Shvachko 0.5 kB
Text File Licensed for inclusion in ASF works NeededRepl.patch 2008-11-06 12:22 AM Konstantin Shvachko 0.6 kB
Issue Links:
Reference

Hadoop Flags: Reviewed
Resolution Date: 07/Nov/08 02:13 AM


 Description  « Hide
Currently during name-node startup under-replicated blocks are not added to the neededReplications queue until the name-node leaves safe mode. This is an optimization since otherwise all blocks will first go into the under-replicated queue and then most of them will be removed from it.
When the name-node leaves safe-mode automatically it checks all blocks to have a correct number of replicas (processMisReplicatedBlocks()).
When the name-node leaves safe-mode manually it does not perform the checkup.
In the latter case all under-replicated blocks remain not replicated forever because there is no alternative mechanism to trigger replications.
The proposal is to call processMisReplicatedBlocks() any time the name-node leaves safe mode - automatically or manually.
In addition to solving that problem this could be an alternative mechanism for refreshing neededReplications and excessReplicateMap sets.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hadoop QA added a comment - 06/Nov/08 11:02 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12393414/NeededRepl.patch
against trunk revision 711734.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3543/console

This message is automatically generated.


Konstantin Shvachko added a comment - 06/Nov/08 11:58 PM
I did manual testing, which confirms the change works as suspected.
  1. Create a new file system containing a few files by starting name-node and 2 data-nodes, and loading a couple of files into it. Then stop the cluster.
  2. Start name-node with dfs.safemode.threshold.pct = 1.1
  3. Start one data-node, which contains exactly one copy of each block.
  4. Call dfsadmin -metasave tmp.txt. File tmp.txt will show that there is 0 "Blocks waiting for replication:".
  5. Call dfsadmin -safemode leave. The name-node will leave safe-mode.
  6. Call dfsadmin -metasave tmp.txt. File tmp.txt will show that the number of "Blocks waiting for replication:" > 0,
    and will list all blocks of the file system because they are all under-replicated.

Without the patch the last step would still show "Blocks waiting for replication: 0".


Raghu Angadi added a comment - 07/Nov/08 12:45 AM

Does the call to leaveSafeMode() in checkMode() also need to pass 'true' for second arg?

Konstantin Shvachko added a comment - 07/Nov/08 01:26 AM
Yes, we are going to always verify misreplicated blocks then.
I am removing the boolean parameter then, since it always has the same value true.

Konstantin Shvachko added a comment - 07/Nov/08 01:39 AM
I'll fix Raghu's issue in subsequent issue.

Konstantin Shvachko added a comment - 07/Nov/08 02:13 AM
I just committed this.

Hudson added a comment - 07/Nov/08 02:30 PM
Integrated in Hadoop-trunk #654 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/654/)
. Calculate mis-replicated blocks when safe-mode is turned of manually. Contributed by Konstantin Shvachko.