[HADOOP-3585] Hardware Failure Monitoring in large clusters running Hadoop/HDFS - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.19.0
Component/s: metrics
Labels:
None
Environment:

Linux

Hadoop Flags:

Reviewed
Release Note:
Added FailMon as a contrib project for hardware failure monitoring and analysis, under /src/contrib/failmon. Created User Manual and Quick Start Guide.

Description

At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS. We are working on a framework that will enable nodes to identify failures on their hardware using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation details are not very clear, but you can see a draft of our design in the attached document. We are pretty interested in Hadoop and system logs from failed machines, so if you are in possession of such, you are very welcome to contribute them; they would be of great value for hardware failure diagnosing.

Some details about our design can be found in the attached document failmon.doc. More details will follow in a later post.

Attachments

FailMon_Package_descrip.html
04/Jul/08 19:49
48 kB
Ioannis Koltsidas
FailMon_QuickStart.html
30/Jul/08 22:44
12 kB
Ioannis Koltsidas
failmon.pdf
22/Jul/08 05:09
6.02 MB
Ioannis Koltsidas
failmon.pdf
17/Jun/08 21:02
28 kB
Ioannis Koltsidas
failmon2.pdf
30/Jul/08 22:44
6.02 MB
Ioannis Koltsidas
FailMon-standalone.zip
04/Jul/08 19:54
4.49 MB
Ioannis Koltsidas
HADOOP-3585.2.patch
07/Aug/08 23:03
133 kB
Ioannis Koltsidas
HADOOP-3585.3.patch
11/Aug/08 20:49
134 kB
Ioannis Koltsidas
HADOOP-3585.patch
30/Jul/08 22:42
134 kB
Ioannis Koltsidas
HADOOP-3585.patch
04/Jul/08 19:56
82 kB
Ioannis Koltsidas

Issue Links

Add Link

is related to

HADOOP-3964 javadoc warnings by failmon

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Ioannis Koltsidas Assign to me

Reporter:: Ioannis Koltsidas

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 17/Jun/08 21:00

Updated:: 20/Nov/08 23:38

Resolved:: 15/Aug/08 09:04

Time Tracking

Log work

Estimated:

480h

Remaining:

480h

Logged:

Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Agile

Slack

Issue deployment