Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3585

Hardware Failure Monitoring in large clusters running Hadoop/HDFS

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.19.0
    • metrics
    • None
    • Linux

    • Reviewed
    • Added FailMon as a contrib project for hardware failure monitoring and analysis, under /src/contrib/failmon. Created User Manual and Quick Start Guide.

    Description

      At IBM we're interested in identifying hardware failures on large clusters running Hadoop/HDFS. We are working on a framework that will enable nodes to identify failures on their hardware using the Hadoop log, the system log and various OS hardware diagnosing utilities. The implementation details are not very clear, but you can see a draft of our design in the attached document. We are pretty interested in Hadoop and system logs from failed machines, so if you are in possession of such, you are very welcome to contribute them; they would be of great value for hardware failure diagnosing.

      Some details about our design can be found in the attached document failmon.doc. More details will follow in a later post.

      Attachments

        1. FailMon_Package_descrip.html
          48 kB
          Ioannis Koltsidas
        2. FailMon_QuickStart.html
          12 kB
          Ioannis Koltsidas
        3. failmon.pdf
          6.02 MB
          Ioannis Koltsidas
        4. failmon.pdf
          28 kB
          Ioannis Koltsidas
        5. failmon2.pdf
          6.02 MB
          Ioannis Koltsidas
        6. FailMon-standalone.zip
          4.49 MB
          Ioannis Koltsidas
        7. HADOOP-3585.2.patch
          133 kB
          Ioannis Koltsidas
        8. HADOOP-3585.3.patch
          134 kB
          Ioannis Koltsidas
        9. HADOOP-3585.patch
          134 kB
          Ioannis Koltsidas
        10. HADOOP-3585.patch
          82 kB
          Ioannis Koltsidas

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ikoltsidas Ioannis Koltsidas Assign to me
            ikoltsidas Ioannis Koltsidas
            Votes:
            1 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 480h
              480h
              Remaining:
              Remaining Estimate - 480h
              480h
              Logged:
              Time Spent - Not Specified
              Not Specified

              Slack

                Issue deployment