Issue Details (XML | Word | Printable)

Key: HADOOP-2560
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: dhruba borthakur
Reporter: Runping Qi
Votes: 0
Watchers: 23
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Processing multiple input splits per mapper task

Created: 09/Jan/08 04:34 PM   Updated: 31/Oct/08 10:12 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works multipleSplitsPerMapper.patch 2008-10-30 05:32 PM dhruba borthakur 12 kB
Issue Links:
Blocker
 
Reference
 


 Description  « Hide
Currently, an input split contains a consecutive chunk of input file, which by default, corresponding to a DFS block.
This may lead to a large number of mapper tasks if the input data is large. This leads to the following problems:

1. Shuffling cost: since the framework has to move M * R map output segments to the nodes running reducers,
larger M means larger shuffling cost.

2. High JVM initialization overhead

3. Disk fragmentation: larger number of map output files means lower read throughput for accessing them.

Ideally, you want to keep the number of mappers to no more than 16 times the number of nodes in the cluster.
To achive that, we can increase the input split size. However, if a split span over more than one dfs block,
you lose the data locality scheduling benefits.

One way to address this problem is to combine multiple input blocks with the same rack into one split.
If in average we combine B blocks into one split, then we will reduce the number of mappers by a factor of B.
Since all the blocks for one mapper share a rack, thus we can benefit from rack-aware scheduling.

Thoughts?



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
eric baldeschwieler made changes - 11/Jan/08 10:34 PM
Field Original Value New Value
Link This issue blocks HADOOP-2014 [ HADOOP-2014 ]
eric baldeschwieler made changes - 11/Jan/08 10:34 PM
Link This issue blocks HADOOP-2014 [ HADOOP-2014 ]
eric baldeschwieler made changes - 11/Jan/08 10:35 PM
Link This issue is related to HADOOP-2014 [ HADOOP-2014 ]
Runping Qi made changes - 14/Aug/08 08:27 PM
Description
Currently, an input split contains a consecutive chunk of input file, which by default, corresponding to a DFS block.
This may lead to a large number of mapper tasks if the input data is large. This leads to the following problems:

1. Shuffling cost: since the framework has to move M * R map output segments to the nodes running reducers,
larger M means larger shuffling cost.

2. High JVM initialization overhead

3. Disk fragmentation: larger number of map output files means lower read throughput for accessing them.

Ideally, you want to keep the number of mappers to no more than 16 times the number of nodes in the cluster.
To achive that, we can increase the input split size. However, if a split span over more than one dfs block,
you lose the data locality scheduling benefits.

One way to address this problem is to combine multiple input blocks with the same rack into one split.
If in average we combine B blocks into one split, then we will reduce the number of mappers by a factor of B.
Since all the blocks for one mapper share a rack, thus we can benefit from rack-aware scheduling.

Thoughts?

Currently, an input split contains a consecutive chunk of input file, which by default, corresponding to a DFS block.
This may lead to a large number of mapper tasks if the input data is large. This leads to the following problems:

1. Shuffling cost: since the framework has to move M * R map output segments to the nodes running reducers,
larger M means larger shuffling cost.

2. High JVM initialization overhead

3. Disk fragmentation: larger number of map output files means lower read throughput for accessing them.

Ideally, you want to keep the number of mappers to no more than 16 times the number of nodes in the cluster.
To achive that, we can increase the input split size. However, if a split span over more than one dfs block,
you lose the data locality scheduling benefits.

One way to address this problem is to combine multiple input blocks with the same rack into one split.
If in average we combine B blocks into one split, then we will reduce the number of mappers by a factor of B.
Since all the blocks for one mapper share a rack, thus we can benefit from rack-aware scheduling.

Thoughts?

Summary Combining multiple input blocks into one mapper Processing multiple input splits per mapper task
Runping Qi made changes - 14/Aug/08 08:52 PM
Link This issue is related to HADOOP-249 [ HADOOP-249 ]
Runping Qi made changes - 14/Aug/08 08:54 PM
Link This issue is blocked by HADOOP-249 [ HADOOP-249 ]
Runping Qi made changes - 30/Oct/08 05:04 PM
Link This issue is related to HADOOP-3293 [ HADOOP-3293 ]
dhruba borthakur made changes - 30/Oct/08 05:32 PM
Attachment multipleSplitsPerMapper.patch [ 12393082 ]
dhruba borthakur made changes - 30/Oct/08 05:33 PM
Assignee dhruba borthakur [ dhruba ]