Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3184

HOD gracefully exclude "bad" nodes during ring formation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.18.0
    • contrib/hod
    • None
    • Incompatible change, Reviewed
    • Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. Introduced new property ringmaster.max-master-failures to specify the maximum number of times a master is allowed to fail.

    Description

      HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.

      This is a frequent HOD user issue (although not directly caused by HOD).

      Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.

      Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

      Attachments

        1. 3184.1.patch
          6 kB
          Hemanth Yamijala
        2. 3184.2.patch
          9 kB
          Hemanth Yamijala

        Activity

          People

            yhemanth Hemanth Yamijala
            menicosia Marco Nicosia
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: