[HADOOP-654] jobs fail with some hardware/system failures on a small number of nodes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.7.2
Fix Version/s: 0.12.0
Component/s: None
Labels:
None

Description

occasionally, such as when the OS is out of some resource, a node fails only partly. The node is up and running, the task tracker is running and sending heartbeats, but every task fails because the tasktracker can't fork tasks or something.
In these cases, that task tracker keeps getting assigned tasks to execute, and they all fail.
A couple of nodes like that and jobs start failing badly.

The job tracker should avoid assigning tasks to tasktrackers that are misbehaving.

simple approach: avoid tasktrackers that report many more failures than average (say 3X). Simply use the info sent by the TT.
better but harder: track TT failures over time and:
1. avoid those that exhibit a high failure rate
2. tell them to shut down

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-654_20070208_1.patch
07/Feb/07 20:23
6 kB
Arun Murthy
HADOOP-654_20070209_2.patch
08/Feb/07 18:53
10 kB
Arun Murthy
HADOOP-654_20070220_3.patch
21/Feb/07 01:19
10 kB
Arun Murthy
HADOOP-654_20070221_4.patch
21/Feb/07 17:16
10 kB
Arun Murthy

Activity

People

Assignee:: Arun Murthy

Reporter:: Yoram Arnon

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 30/Oct/06 18:47

Updated:: 08/Jul/09 16:51

Resolved:: 22/Feb/07 19:27