[MAPREDUCE-1800] using map output fetch failures to blacklist nodes is problematic - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

If a mapper and a reducer cannot communicate, then either party could be at fault. The current hadoop protocol allows reducers to declare nodes running the mapper as being at fault. When sufficient number of reducers do so - then the map node can be blacklisted.

In cases where networking problems cause substantial degradation in communication across sets of nodes - then large number of nodes can become blacklisted as a result of this protocol. The blacklisting is often wrong (reducers on the smaller side of the network partition can collectively cause nodes on the larger network partitioned to be blacklisted) and counterproductive (rerunning maps puts further load on the (already) maxed out network links).

We should revisit how we can better identify nodes with genuine network problems (and what role, if any, map-output fetch failures have in this).

Attachments

Issue Links

is related to

MAPREDUCE-562 A single slow (but not dead) map TaskTracker impedes MapReduce progress

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joydeep Sen Sarma

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 19/May/10 18:08

Updated:: 22/May/10 14:57