[MAPREDUCE-902] Map output merge still uses unnecessary seeks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.20.1
Fix Version/s: None
Component/s: task
Labels:
None

Description

~~HADOOP-3638~~ improved the merge of the map output by caching the index files.

But why not also caching the data files?

In our use-case scenario, still using hadoop-0.18.3, but ~~HADOOP-3638~~ would only help partially, an individual map tasks finishes in less than 30 minutes, but needs 4 hours to merge 70 spills for 20,000 partitions (with lzo compression), reading about 10kB from each spill file (which is re-opened for every partition). As this is just a merge sort, there is no reason to not keep the input files open and eliminate seek altogether with sequential access.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Christian Kunz

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Aug/09 20:17

Updated:: 23/Jul/14 21:51

Resolved:: 23/Jul/14 21:51