[HADOOP-2247] Mappers fail easily due to repeated failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.15.0
Fix Version/s: 0.16.0
Component/s: None
Labels:
None
Environment:

1400 Node hadoop cluster

Description

Related to ~~HADOOP-2220~~, problem introduced in ~~HADOOP-1158~~

At this scale hardcoding the number of fetch failures to a static number: in this case 3 is never going to work. Although the jobs we are running are loading the systems 3 failures can randomly occur within the lifetime of a map. Even fetching the data can cause enough load for so many failures to occur.

We believe that number of tasks and size of cluster should be taken into account. Based on which we believe that a ratio between total fetch attempts and total failed attempts should be taken into consideration.

Given our experience with a task should be declared "Too many fetch failures" based on:

failures > n /could be 3/ && (failures/total attempts) > k% /could be 30-40%/

Basically the first factor is to give some headstart to the second factor, second factor then takes into account the cluster size and the task size.

Additionally we could take recency into account, say failures and attempts in last one hour. We do not want to make it too small.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-2220.patch
21/Dec/07 11:31
9 kB
Amar Kamat
HADOOP-2220.patch
13/Dec/07 14:47
14 kB
Amar Kamat
HADOOP-2220.patch
13/Dec/07 11:50
14 kB
Amar Kamat

Issue Links

incorporates

HADOOP-2220 Reduce tasks fail too easily because of repeated fetch failures

Closed

relates to

HADOOP-2220 Reduce tasks fail too easily because of repeated fetch failures

Closed

Activity

People

Assignee:: Amar Kamat

Reporter:: Srikanth Kakani

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Nov/07 21:21

Updated:: 08/Feb/08 23:38

Resolved:: 21/Dec/07 20:27