[MAPREDUCE-573] reduce scans/copies while reading data in hadoop streaming - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: contrib/streaming
Labels:
- newbie

Description

follow up from: http://issues.apache.org/jira/browse/HADOOP-2826

we copy over an entire line (from readLine) and then we break it into two strings by splitting on tab. So there is an extra scan of the input data and an extra copy based on splitting by tab.

instead if we generalized LineReader to instead read until it hits a delimiter - then we can do it with one less scan and copy. Something like:

byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n'; newlineDelimiter[1] = '\r';

while()

{ lineReader.setDelimiter(tabDelimiter); lineReader.readLine(key); lineReader.setDelimiter(newlineDelimiter); lineReader.readLine(value); }

(take my proposed interfaces with a pinch of salt. just to convey the idea).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Joydeep Sen Sarma

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 15/Apr/08 00:00

Updated:: 17/Jul/14 21:57