[HADOOP-3184] HOD gracefully exclude "bad" nodes during ring formation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.18.0
Component/s: contrib/hod
Labels:
None

Hadoop Flags:

Incompatible change, Reviewed
Release Note:
Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on another node in the ring. Introduced new property ringmaster.max-master-failures to specify the maximum number of times a master is allowed to fail.

Description

HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation, the entire ring should not be dependent upon every single node being good. Instead, it should either exclude any ring member that does not adequately join the ring in a specified amount of time.

This is a frequent HOD user issue (although not directly caused by HOD).

Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache corrupt, slow network links, drives just beginning to fail, etc.

Many of these conditions are known, and we can monitor for those separately, but this enhancement would shield users from unknown failure conditions that we haven't yet anticipated. This way, a user will get a cluster, instead of hanging indefinitely.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

3184.1.patch
05/Jun/08 10:56
6 kB
Hemanth Yamijala
3184.2.patch
06/Jun/08 05:51
9 kB
Hemanth Yamijala

Activity

People

Assignee:: Hemanth Yamijala

Reporter:: Marco Nicosia

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 04/Apr/08 23:30

Updated:: 22/Aug/08 19:50

Resolved:: 06/Jun/08 10:58