Hadoop Common
  1. Hadoop Common
  2. HADOOP-3184

HOD gracefully exclude "bad" nodes during ring formation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: contrib/hod
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. Introduced new property ringmaster.max-master-failures to specify the maximum number of times a master is allowed to fail.

      Description

      HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.

      This is a frequent HOD user issue (although not directly caused by HOD).

      Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.

      Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

      1. 3184.1.patch
        6 kB
        Hemanth Yamijala
      2. 3184.2.patch
        9 kB
        Hemanth Yamijala

        Activity

        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Robert Chansler made changes -
        Hadoop Flags [Reviewed, Incompatible change] [Incompatible change, Reviewed]
        Release Note Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. These retries are done a configured number of times per master. The change is incompatible because a new required configuration option is introduced: ringmaster.max-master-failures, which defines the maximum number of times a master is allowed to fail. Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. Introduced new property ringmaster.max-master-failures to specify the maximum number of times a master is allowed to fail.
        Devaraj Das made changes -
        Resolution Fixed [ 1 ]
        Hadoop Flags [Reviewed, Incompatible change] [Incompatible change, Reviewed]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hemanth Yamijala made changes -
        Release Note Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. These retries are done a configured number of times per master. The change is incompatible because a new required configuration option is introduced: ringmaster.max-master-failures, which defines the maximum number of times a master is allowed to fail.
        Hadoop Flags [Reviewed] [Incompatible change, Reviewed]
        Hemanth Yamijala made changes -
        Release Note Running through Hudson.
        Hemanth Yamijala made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hadoop Flags [Reviewed]
        Release Note Running through Hudson.
        Hemanth Yamijala made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hemanth Yamijala made changes -
        Attachment 3184.2.patch [ 12383534 ]
        Hemanth Yamijala made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hemanth Yamijala made changes -
        Attachment 3184.1.patch [ 12383452 ]
        Hemanth Yamijala made changes -
        Assignee Hemanth Yamijala [ yhemanth ]
        Hemanth Yamijala made changes -
        Field Original Value New Value
        Fix Version/s 0.18.0 [ 12312972 ]
        Marco Nicosia created issue -

          People

          • Assignee:
            Hemanth Yamijala
            Reporter:
            Marco Nicosia
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development