[HADOOP-331] map outputs should be written to a single output file with an index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.3.2
Fix Version/s: 0.10.0
Component/s: None
Labels:
None

Description

The current strategy of writing a file per target map is consuming a lot of unused buffer space (causing out of memory crashes) and puts a lot of burden on the FS (many opens, inodes used, etc).

I propose that we write a single file containing all output and also write an index file IDing which byte range in the file goes to each reduce. This will remove the issue of buffer waste, address scaling issues with number of open files and generally set us up better for scaling. It will also have advantages with very small inputs, since the buffer cache will reduce the number of seeks needed and the data serving node can open a single file and just keep it open rather than needing to do directory and open ops on every request.

The only issue I see is that in cases where the task output is substantiallyu larger than its input, we may need to spill multiple times. In this case, we can do a merge after all spills are complete (or during the final spill).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

331.patch
07/Dec/06 12:18
60 kB
Devaraj Das
331.txt
20/Oct/06 00:54
4 kB
Devaraj Das
331-design.txt
31/Oct/06 18:18
4 kB
Devaraj Das
331-initial3.patch
30/Nov/06 16:32
67 kB
Devaraj Das

Issue Links

incorporates

HADOOP-717 When there are few reducers, sorting should be done by mappers

Closed

is cloned by

HADOOP-570 Map tasks may fail due to out of memory, if the number of reducers are moderately big

Closed

Activity

People

Assignee:: Devaraj Das

Reporter:: Eric Baldeschwieler

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jun/06 03:50

Updated:: 08/Jul/09 16:51

Resolved:: 08/Dec/06 01:58