Issue Details (XML | Word | Printable)

Key: HADOOP-4103
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Raghu Angadi
Reporter: Christian Kunz
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Alert for missing blocks

Created: 07/Sep/08 07:18 PM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: 0.17.2
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-4103-branch-20.patch 2009-03-04 03:17 AM Raghu Angadi 18 kB
Text File Licensed for inclusion in ASF works HADOOP-4103.patch 2009-03-02 10:03 PM Raghu Angadi 22 kB
Text File Licensed for inclusion in ASF works HADOOP-4103.patch 2009-02-27 01:00 AM Raghu Angadi 22 kB
Text File Licensed for inclusion in ASF works HADOOP-4103.patch 2009-02-26 01:24 AM Raghu Angadi 23 kB
Text File Licensed for inclusion in ASF works HADOOP-4103.patch 2009-02-20 01:35 AM Raghu Angadi 19 kB

Hadoop Flags: Incompatible change, Reviewed
Release Note: Modified dfsadmin -report to report under replicated blocks. blocks with corrupt replicas, and missing blocks".
Resolution Date: 04/Mar/09 03:18 AM


 Description  « Hide
A whole bunch of datanodes became dead because of some network problems resulting in heartbeat timeouts although datanodes were fine.

Many processes started to fail because of the corrupted filesystem.

In order to catch and diagnose such problems faster the namenode should detect the corruption automatically and provide a way to alert operations. At the minimum it should show the fact of corruption on the GUI.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Raghu Angadi added a comment - 25/Nov/08 09:40 PM
I thinking of implementing a background fsck on NameNode. This will share/reuse most of the code with current Fsck. The extra features will be to facilitate an admin to quickly check if there something odd (e.g. ability list last 100 or so blocks in inconsistent state).

Based on this background check there could be further improvements to monitoring more alarms over time.. as well as reducing latency of detection.

This feature will be optional. Scan period could be around a day.


Raghu Angadi added a comment - 24/Jan/09 01:20 AM - edited
(Edit : formatting only)

The scope of the fix is narrowed to the following :

  • NameNode webui shows in (probably in red) indicating if there are any missing blocks.
    • will mostly add simon stats for such a number.
  • 'dfsadmin -metasave' can be used to find all the missing blocks
    • a later jira will enhance -metasave or have different command that is more user friendly. currently -metasave is mainly meant for developers.

For this to be a straight forward fix, I need to make one policy change: currently if a block does not have any good replicas left it is not included in "neededReplications" list. I think this was done mainly as an "optimization". But a cluster should not have any blocks this state. even 'neededReplications' name implies such blocks should be included. It would be better if I don't need to add another list that need to be maintained.


Raghu Angadi added a comment - 20/Feb/09 01:35 AM

The patch for missing block alerts. A user can monitor this in multiple ways :

  1. 'bin/hdfs dfsadmin -report' reports this count.
  2. A warning is pasted in red on NameNode front page
  3. new stat is added (for Simon, for e.g.).
    • Also added a stat to report size of corrupt replicas map

Once the alert is noticed, admin can run 'dfsadmin -metasave' to find out which specific blocks are missing. 'metasave' is improved a bit to list replica info for each block in 'neededReplication' list and the line for a missing blocks contains the word "MISSING".

This is a very non-intrusive change, thus fairly safe for backporting. No new state or data structures for NN to track.


Suresh Srinivas added a comment - 25/Feb/09 02:06 AM
1. NamenodeProtocol.getStats() method documentation needs to be updated about the fourth stat that is being reported
2. DFSAdmin.java - remove space before : in "Missing Blocks (approx) : ". Additionally is it a good idea to print number of corrupt blocks, pending replication, scheduled replication and under replicated block counts in the report? Currently what is printed in dfsadmin report is also printed in the cluster summary on namenode web page. It may be a good idea to keep both of them consistent.
3. FSNamesystem.java computeReplicationWork() move the added code block that sets missingBlocksInCurIter, missingBlocksInPrevIter to zero, above the comments preceding it.

Would this change be incompatible because of change in the output of dfsadmin report command?


Raghu Angadi added a comment - 26/Feb/09 01:24 AM

Thanks Suresh.

Updated patch includes all the suggestions.

'dfsadmin -report' now prints 3 extra lines one for each of "Under replicated blocks" "Blocks with corrupt replicas" "Missing blocks". The last two counts should be zero normally. The first count should be low and should keep going down.

Regd whether it should be treated as "imcompatible" change.. I personally don't think so. But does not matter either way.


Suresh Srinivas added a comment - 26/Feb/09 09:50 PM
Comments:
  1. DFSAdmin.java please remove the space before : in the newly introduced output
  2. NameNodeMetrics.numBlocksCorrupted exposes the same data as FSNamesystemMetrics.corruptReplicaBlocks. Not sure where the new metrics introduced by this patch should go into

Raghu Angadi added a comment - 27/Feb/09 01:00 AM
Thanks Suresh.

Attached patch fixes both. The new stat for corrupt block is not required since it is already there. I didn't see that earlier.


Suresh Srinivas added a comment - 27/Feb/09 02:21 AM
+1 for the patch

Raghu Angadi added a comment - 27/Feb/09 07:13 PM
I hope this gets marked for 0.20. It is pretty safe. Otherwise , I am pretty sure I will have to back port it again in near future and duplicate considerable constant effort associated with a new jira and a commit.

Bill Au added a comment - 28/Feb/09 02:35 AM
I think this feature is very useful and would like to see it for 0.20 too.

Hadoop QA added a comment - 01/Mar/09 09:28 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12401076/HADOOP-4103.patch
against trunk revision 748861.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 11 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/26/console

This message is automatically generated.


Raghu Angadi added a comment - 02/Mar/09 08:41 PM
I forgot to run the test again after the changes to patch based on review.

Raghu Angadi added a comment - 02/Mar/09 10:03 PM
minor fix to a string in the unit test.

Raghu Angadi added a comment - 03/Mar/09 02:05 AM
If there are no objections, I am planning to commit this to 0.20.

This is a pretty useful feature for admins and is pretty safe patch. Please let me know if there are concerns.


Hadoop QA added a comment - 03/Mar/09 11:48 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12401261/HADOOP-4103.patch
against trunk revision 749318.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 11 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/40/console

This message is automatically generated.


Raghu Angadi added a comment - 04/Mar/09 12:35 AM
Failed contrib test is a known issue : HADOOP-5068

Raghu Angadi added a comment - 04/Mar/09 03:17 AM
Patch 0.20 is attached. The trunk patch conflicts with 0.20.

Raghu Angadi added a comment - 04/Mar/09 03:18 AM
I just committed this.

Hudson added a comment - 13/Mar/09 03:05 PM