At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS. We are working on a framework that will enable nodes to identify failures on their hardware using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation details are not very clear, but you can see a draft of our design in the attached document. We are pretty interested in Hadoop and system logs from failed machines, so if you are in possession of such, you are very welcome to contribute them; they would be of great value for hardware failure diagnosing.
Some details about our design can be found in the attached document failmon.doc. More details will follow in a later post.