Issue Details (XML | Word | Printable)

Key: MAPREDUCE-772
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Abdul Qadeer
Reporter: Abdul Qadeer
Votes: 0
Watchers: 8
Operations

If you were logged in you would be able to see more operations.
Hadoop Map/Reduce

Chaging LineRecordReader algo so that it does not need to skip backwards in the stream

Created: 23/Aug/08 12:39 AM   Updated: 21/Jul/09 05:17 AM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.21.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 4010-mapreduce.patch 2009-07-19 09:14 PM Chris Douglas 2 kB
Text File Licensed for inclusion in ASF works Hadoop-4010.patch 2008-08-23 12:45 AM Abdul Qadeer 2 kB
Text File Licensed for inclusion in ASF works Hadoop-4010_version2.patch 2008-08-31 06:04 AM Abdul Qadeer 2 kB
Text File Licensed for inclusion in ASF works Hadoop-4010_version3.patch 2008-09-16 07:58 AM Abdul Qadeer 4 kB
Issue Links:
Reference
 
dependent
 

Hadoop Flags: Incompatible change, Reviewed
Resolution Date: 21/Jul/09 05:17 AM


 Description  « Hide
The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)

In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
No work has yet been logged on this issue.