[HADOOP-1609] Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.14.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and then written to file with append(key, value):

      DataInputBuffer keyIn = new DataInputBuffer();
      DataInputBuffer valIn = new DataInputBuffer();
      DataOutputBuffer valOut = new DataOutputBuffer();
      while (resultIter.next()) {
        keyIn.reset(resultIter.getKey().getData(), 
                    resultIter.getKey().getLength());
        key.readFields(keyIn);
        valOut.reset();
        (resultIter.getValue()).writeUncompressedBytes(valOut);
        valIn.reset(valOut.getData(), valOut.getLength());
        value.readFields(valIn);
        writer.append(key, value);
        reporter.progress();
      }

When you have complex objects, like nutch's ParseData or Inlinks, this takes time and creates lots of garbage.

I've created a patch, it seems to be working, only tested on 0.13.0.
It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in SequenceFile.Writer.

Thoughts?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

spill.patch
13/Jul/07 11:41
6 kB
Espen Amble Kolstad
spill.patch
13/Jul/07 13:05
6 kB
Espen Amble Kolstad

Issue Links

is part of

HADOOP-2919 Create fewer copies of buffer data during sort/spill

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Espen Amble Kolstad

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Jul/07 11:41

Updated:: 08/Jul/09 16:52

Resolved:: 20/Jun/08 01:10