Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1609

Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.14.0
    • None
    • None
    • None

    Description

      In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and then written to file with append(key, value):

            DataInputBuffer keyIn = new DataInputBuffer();
            DataInputBuffer valIn = new DataInputBuffer();
            DataOutputBuffer valOut = new DataOutputBuffer();
            while (resultIter.next()) {
              keyIn.reset(resultIter.getKey().getData(), 
                          resultIter.getKey().getLength());
              key.readFields(keyIn);
              valOut.reset();
              (resultIter.getValue()).writeUncompressedBytes(valOut);
              valIn.reset(valOut.getData(), valOut.getLength());
              value.readFields(valIn);
              writer.append(key, value);
              reporter.progress();
            }
      

      When you have complex objects, like nutch's ParseData or Inlinks, this takes time and creates lots of garbage.

      I've created a patch, it seems to be working, only tested on 0.13.0.
      It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in SequenceFile.Writer.

      Thoughts?

      Attachments

        1. spill.patch
          6 kB
          Espen Amble Kolstad
        2. spill.patch
          6 kB
          Espen Amble Kolstad

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kolstae Espen Amble Kolstad
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: