Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-5681

Add Nagios alert if HDFS last checkpoint time exceeds threshold

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.6.0
    • None
    • None

    Description

      Description: If the secondary NameNode(SNN) failed to merge edit files for any
      reason, Nagios doesn't alert on it.

      PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
      undetected. This can cause the edit files to become very large and slows down
      NameNode performance. And in some cases, can lead to corruption of NameNode
      edit files.
      BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
      eventually cause long downtime for all of customers and a possiblitly of data
      loss.

      STEPS TO REPRODUCE:

      • SNN fails to merge edit files for any reason
      • NameNode edit files grow in size
      • Corruption to edit files.

      ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
      EXPECTED BEHAVIOR: Nagios should fire critical alarm

      SUPPORT ANALYSIS: N/A

      Note:

      We need to get this fixed and alert our customers to add the nagios alarm
      ASAP.

      Attachments

        Issue Links

          Activity

            People

              aonishuk Andrew Onischuk
              aonishuk Andrew Onischuk
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: