|
[
Permlink
| « Hide
]
Owen O'Malley added a comment - 06/Nov/07 05:48 PM
I would check for floating point precision problems. If the calculation was done carelessly, you may end up not being able to reach the desired goal of 1.0.
Did you mean to say:
the namenode does NOT turn off safemode automatically Yes, I missed the not. Thanks for pointing out.
Robert Chansler made changes - 07/Dec/07 03:00 AM
Name Node keeps track of the total number of valid block it received in safe mode. A valid block is a block that belongs to a file. The counter is called blockSafe. The name node does not leave the safe mode automatically if the ratio of blockSafe to the total number of valid blocks is less the threshold.
I see a bug in maintaining this counter. Before the counter is incremented, the name node check if the block is valid. Before it does not do the check before this counter is decremented. When a dfs cluster is started, if an early started data node has stale blocks, the name node will ask the data node to delete the stale blocks as the reply to its first block report. If its second block report comes in when the name node is still in safe mode, those blocks will be removed from the blocks map, and the blockSafe counter will also be decremented even though those blocks are invalid. So the cluster will end up with a blockSafe counter that's smaller than the number of valid blocks in namenode. If the threshold is set to be 1, the cluster will not be able to leave the safe mode. We also had one cluster that didn't come out of safemode.
fsck showed, Status: HEALTHY Total dirs: 270001 Total files: 1036456 Total blocks: 1982902 (avg. block size 92619603 B) Minimally replicated blocks: 1982902 (100.00001 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0000212 Missing replicas: 0 (0.0 %) The filesystem under path '/' is HEALTHY I attached btrace to obtain the state of Namenode.namesystem.safeMode, -bash-3.1$ pgrep -f NameNode
30490
-bash-3.1$ btrace -cp hadoop-core.jar 30490 DFSSafeModeTrace.java
entered org.apache.hadoop.dfs.FSNamesystem.isInSafeMode
org.apache.hadoop.dfs.FSNamesystem@3c992fa5
{threshold=1.0, extension=30000, safeReplication=1, reached=0, blockTotal=1982902, blockSafe=1982812,
this$0=org.apache.hadoop.dfs.FSNamesystem@3c992fa5, }
This shows that blockSafe < blockTotal, which supports Hairong's comment above. (dfs.safemode.threshold.pct is set to 1.0f) Please note that the cluster did have stale blocks. As shown in the fsck result, the ratio of the minimally replicated blocks to the total number of valid blocks is greater than 100%,
Here is a patch that should fix the bug.
Hairong Kuang made changes - 21/May/08 10:59 PM
+1 the patch looks good
Tsz Wo (Nicholas), SZE made changes - 22/May/08 02:15 AM
Hairong Kuang made changes - 30/May/08 06:03 PM
Hairong Kuang made changes - 30/May/08 06:03 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12382523/safemode.patch against trunk revision 661771. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2524/testReport/ This message is automatically generated.
The patch is an one-line change that was analyzed and tested on a real cluster. But a unit test is not trival, so it is not required.
I just committed this. Integrated in Hadoop-trunk #509 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/509/
Hairong Kuang made changes - 02/Jun/08 11:34 PM
Integrated in Hadoop-trunk #523 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/523/
Owen O'Malley made changes - 08/Jul/09 04:42 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||