Issue Details (XML | Word | Printable)

Key: HADOOP-2159
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Hairong Kuang
Reporter: Christian Kunz
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Namenode stuck in safemode

Created: 06/Nov/07 06:34 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.16.0
Fix Version/s: 0.17.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works safemode.patch 2008-05-21 10:59 PM Hairong Kuang 0.7 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 02/Jun/08 11:34 PM


 Description  « Hide
Occasionally (not easy to reproduce) the namenode does turn off safemode automatically, although fsck does not report any missing or under-replicated blocks (safemode threshold set to 1.0).

At this moment I do not have any additional information which could help analyze the issue.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Owen O'Malley added a comment - 06/Nov/07 05:48 PM
I would check for floating point precision problems. If the calculation was done carelessly, you may end up not being able to reach the desired goal of 1.0.

Konstantin Shvachko added a comment - 06/Nov/07 06:04 PM
Did you mean to say:
the namenode does NOT turn off safemode automatically

Christian Kunz added a comment - 06/Nov/07 06:32 PM
Yes, I missed the not. Thanks for pointing out.

Robert Chansler made changes - 07/Dec/07 03:00 AM
Field Original Value New Value
Link This issue is related to HADOOP-2373 [ HADOOP-2373 ]
Hairong Kuang added a comment - 21/May/08 10:33 PM
Name Node keeps track of the total number of valid block it received in safe mode. A valid block is a block that belongs to a file. The counter is called blockSafe. The name node does not leave the safe mode automatically if the ratio of blockSafe to the total number of valid blocks is less the threshold.

I see a bug in maintaining this counter. Before the counter is incremented, the name node check if the block is valid. Before it does not do the check before this counter is decremented.

When a dfs cluster is started, if an early started data node has stale blocks, the name node will ask the data node to delete the stale blocks as the reply to its first block report. If its second block report comes in when the name node is still in safe mode, those blocks will be removed from the blocks map, and the blockSafe counter will also be decremented even though those blocks are invalid. So the cluster will end up with a blockSafe counter that's smaller than the number of valid blocks in namenode. If the threshold is set to be 1, the cluster will not be able to leave the safe mode.


Koji Noguchi added a comment - 21/May/08 10:51 PM
We also had one cluster that didn't come out of safemode.

fsck showed,

Status: HEALTHY
 Total dirs:    270001
 Total files:   1036456
 Total blocks:  1982902 (avg. block size 92619603 B)
 Minimally replicated blocks:   1982902 (100.00001 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0000212
 Missing replicas:              0 (0.0 %)
The filesystem under path '/' is HEALTHY

I attached btrace to obtain the state of Namenode.namesystem.safeMode,

-bash-3.1$ pgrep -f NameNode
30490

-bash-3.1$ btrace -cp hadoop-core.jar 30490 DFSSafeModeTrace.java
entered org.apache.hadoop.dfs.FSNamesystem.isInSafeMode

org.apache.hadoop.dfs.FSNamesystem@3c992fa5
{threshold=1.0, extension=30000, safeReplication=1, reached=0, blockTotal=1982902, blockSafe=1982812,
this$0=org.apache.hadoop.dfs.FSNamesystem@3c992fa5, }

This shows that blockSafe < blockTotal, which supports Hairong's comment above.

(dfs.safemode.threshold.pct is set to 1.0f)


Hairong Kuang added a comment - 21/May/08 10:55 PM - edited
Please note that the cluster did have stale blocks. As shown in the fsck result, the ratio of the minimally replicated blocks to the total number of valid blocks is greater than 100%,

Hairong Kuang added a comment - 21/May/08 10:59 PM
Here is a patch that should fix the bug.

Hairong Kuang made changes - 21/May/08 10:59 PM
Attachment safemode.patch [ 12382523 ]
Tsz Wo (Nicholas), SZE added a comment - 22/May/08 02:15 AM
+1 the patch looks good

Tsz Wo (Nicholas), SZE made changes - 22/May/08 02:15 AM
Assignee Hairong Kuang [ hairong ]
Hadoop Flags [Reviewed]
Status Open [ 1 ] Patch Available [ 10002 ]
Hairong Kuang made changes - 30/May/08 06:03 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Hairong Kuang made changes - 30/May/08 06:03 PM
Fix Version/s 0.17.1 [ 12313190 ]
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 30/May/08 08:06 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12382523/safemode.patch
against trunk revision 661771.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2524/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2524/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2524/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2524/console

This message is automatically generated.


Repository Revision Date User Message
ASF #661912 Fri May 30 23:29:25 UTC 2008 hairong HADOOP-2159. Name node should not decrment blockSafe for invalid blocks. Contributed by Hairong Kuang.
Files Changed
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSNamesystem.java
MODIFY /hadoop/core/trunk/CHANGES.txt

Repository Revision Date User Message
ASF #661914 Fri May 30 23:54:37 UTC 2008 hairong Merge -r 661911:661912 from trunk to main to move the change log of HADOOP-2159 into the release 0.17.1 section
Files Changed
MODIFY /hadoop/core/branches/branch-0.17/CHANGES.txt
MODIFY /hadoop/core/branches/branch-0.17/src/java/org/apache/hadoop/dfs/FSNamesystem.java

Hairong Kuang added a comment - 30/May/08 11:58 PM
The patch is an one-line change that was analyzed and tested on a real cluster. But a unit test is not trival, so it is not required.

I just committed this.


Hudson added a comment - 01/Jun/08 01:49 PM

Hairong Kuang made changes - 02/Jun/08 11:34 PM
Status Patch Available [ 10002 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Repository Revision Date User Message
ASF #669333 Wed Jun 18 23:31:04 UTC 2008 acmurthy Reflecting that HADOOP-2159, HADOOP-3472, HADOOP-3442, HADOOP-3477, HADOOP-3475, HADOOP-3550 & HADOOP-3526 have been merged to BRANCH-0.17
Files Changed
MODIFY /hadoop/core/trunk/CHANGES.txt

Repository Revision Date User Message
ASF #669335 Wed Jun 18 23:44:17 UTC 2008 acmurthy Merge -r 669332:669333 from trunk to BRANCH-0.17 to fix CHANGES.txt for HADOOP-2159, HADOOP-3472, HADOOP-3442, HADOOP-3477, HADOOP-3475, HADOOP-3550 & HADOOP-3526
Files Changed
MODIFY /hadoop/core/branches/branch-0.17/CHANGES.txt

Hudson added a comment - 19/Jun/08 12:35 PM

Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]