[HADOOP-1043] Optimize the shuffle phase (increase the parallelism) - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.0
Component/s: None
Labels:
None

Description

In the current shuffle code, only one map output location node is accessed from any Reduce at any given point of time. For example, if a particular node, say machine1.foo.com ran 300 maps, the reducer would fetch just one output from there at a time. machine1.foo.com will be inserted into a Set datastructure (uniqueHosts) and until it gets removed from there, no other map output will be fetched from that machine. The fact that only one map output is fetched at a time from any particular host seems fine, but the logic for removing a node from uniqueHosts is such that there could be a lot of delay before a node gets deleted from the Set datastructure (even after the map output has been fetched from that node). This probably leads to suboptimal performance since it reduces the parallelism in fetching.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1043.patch
27/Feb/07 07:25
0.7 kB
Devaraj Das

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Devaraj Das

Reporter:: Devaraj Das

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 27/Feb/07 05:37

Updated:: 08/Jul/09 16:52

Resolved:: 28/Feb/07 19:59

Agile

View on Board

Optimize the shuffle phase (increase the parallelism)

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment