[HADOOP-1043] Optimize the shuffle phase (increase the parallelism) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.0
Component/s: None
Labels:
None

Description

In the current shuffle code, only one map output location node is accessed from any Reduce at any given point of time. For example, if a particular node, say machine1.foo.com ran 300 maps, the reducer would fetch just one output from there at a time. machine1.foo.com will be inserted into a Set datastructure (uniqueHosts) and until it gets removed from there, no other map output will be fetched from that machine. The fact that only one map output is fetched at a time from any particular host seems fine, but the logic for removing a node from uniqueHosts is such that there could be a lot of delay before a node gets deleted from the Set datastructure (even after the map output has been fetched from that node). This probably leads to suboptimal performance since it reduces the parallelism in fetching.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1043.patch
27/Feb/07 07:25
0.7 kB
Devaraj Das

Activity

People

Assignee:: Devaraj Das

Reporter:: Devaraj Das

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 27/Feb/07 05:37

Updated:: 08/Jul/09 16:52

Resolved:: 28/Feb/07 19:59