Hadoop Common
  1. Hadoop Common
  2. HADOOP-3585

Hardware Failure Monitoring in large clusters running Hadoop/HDFS

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: metrics
    • Labels:
      None
    • Environment:

      Linux

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added FailMon as a contrib project for hardware failure monitoring and analysis, under /src/contrib/failmon. Created User Manual and Quick Start Guide.

      Description

      At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS. We are working on a framework that will enable nodes to identify failures on their hardware using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation details are not very clear, but you can see a draft of our design in the attached document. We are pretty interested in Hadoop and system logs from failed machines, so if you are in possession of such, you are very welcome to contribute them; they would be of great value for hardware failure diagnosing.

      Some details about our design can be found in the attached document failmon.doc. More details will follow in a later post.

      1. HADOOP-3585.patch
        82 kB
        Ioannis Koltsidas
      2. HADOOP-3585.patch
        134 kB
        Ioannis Koltsidas
      3. HADOOP-3585.3.patch
        134 kB
        Ioannis Koltsidas
      4. HADOOP-3585.2.patch
        133 kB
        Ioannis Koltsidas
      5. FailMon-standalone.zip
        4.49 MB
        Ioannis Koltsidas
      6. failmon2.pdf
        6.02 MB
        Ioannis Koltsidas
      7. failmon.pdf
        28 kB
        Ioannis Koltsidas
      8. failmon.pdf
        6.02 MB
        Ioannis Koltsidas
      9. FailMon_QuickStart.html
        12 kB
        Ioannis Koltsidas
      10. FailMon_Package_descrip.html
        48 kB
        Ioannis Koltsidas

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Ioannis Koltsidas
              Reporter:
              Ioannis Koltsidas
            • Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 480h
                480h
                Remaining:
                Remaining Estimate - 480h
                480h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development