Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-573

reduce scans/copies while reading data in hadoop streaming

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • contrib/streaming

    Description

      follow up from: http://issues.apache.org/jira/browse/HADOOP-2826

      we copy over an entire line (from readLine) and then we break it into two strings by splitting on tab. So there is an extra scan of the input data and an extra copy based on splitting by tab.

      instead if we generalized LineReader to instead read until it hits a delimiter - then we can do it with one less scan and copy. Something like:

      byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
      byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n'; newlineDelimiter[1] = '\r';

      while()

      { lineReader.setDelimiter(tabDelimiter); lineReader.readLine(key); lineReader.setDelimiter(newlineDelimiter); lineReader.readLine(value); }

      (take my proposed interfaces with a pinch of salt. just to convey the idea).

      Attachments

        Activity

          People

            Unassigned Unassigned
            jsensarma Joydeep Sen Sarma
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: