|
(Edit : formatting only)
The scope of the fix is narrowed to the following :
For this to be a straight forward fix, I need to make one policy change: currently if a block does not have any good replicas left it is not included in "neededReplications" list. I think this was done mainly as an "optimization". But a cluster should not have any blocks this state. even 'neededReplications' name implies such blocks should be included. It would be better if I don't need to add another list that need to be maintained. The patch for missing block alerts. A user can monitor this in multiple ways :
Once the alert is noticed, admin can run 'dfsadmin -metasave' to find out which specific blocks are missing. 'metasave' is improved a bit to list replica info for each block in 'neededReplication' list and the line for a missing blocks contains the word "MISSING". This is a very non-intrusive change, thus fairly safe for backporting. No new state or data structures for NN to track. 1. NamenodeProtocol.getStats() method documentation needs to be updated about the fourth stat that is being reported
2. DFSAdmin.java - remove space before : in "Missing Blocks (approx) : ". Additionally is it a good idea to print number of corrupt blocks, pending replication, scheduled replication and under replicated block counts in the report? Currently what is printed in dfsadmin report is also printed in the cluster summary on namenode web page. It may be a good idea to keep both of them consistent. 3. FSNamesystem.java computeReplicationWork() move the added code block that sets missingBlocksInCurIter, missingBlocksInPrevIter to zero, above the comments preceding it. Would this change be incompatible because of change in the output of dfsadmin report command? Thanks Suresh. Updated patch includes all the suggestions. 'dfsadmin -report' now prints 3 extra lines one for each of "Under replicated blocks" "Blocks with corrupt replicas" "Missing blocks". The last two counts should be zero normally. The first count should be low and should keep going down. Regd whether it should be treated as "imcompatible" change.. I personally don't think so. But does not matter either way. Comments:
Thanks Suresh.
Attached patch fixes both. The new stat for corrupt block is not required since it is already there. I didn't see that earlier. I hope this gets marked for 0.20. It is pretty safe. Otherwise , I am pretty sure I will have to back port it again in near future and duplicate considerable constant effort associated with a new jira and a commit.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12401076/HADOOP-4103.patch against trunk revision 748861. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/testReport/ This message is automatically generated. I forgot to run the test again after the changes to patch based on review.
minor fix to a string in the unit test.
If there are no objections, I am planning to commit this to 0.20.
This is a pretty useful feature for admins and is pretty safe patch. Please let me know if there are concerns. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12401261/HADOOP-4103.patch against trunk revision 749318. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/testReport/ This message is automatically generated. Failed contrib test is a known issue :
Patch 0.20 is attached. The trunk patch conflicts with 0.20.
Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Based on this background check there could be further improvements to monitoring more alarms over time.. as well as reducing latency of detection.
This feature will be optional. Scan period could be around a day.