[MAPREDUCE-772] Chaging LineRecordReader algo so that it does not need to skip backwards in the stream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.21.0
Component/s: None
Labels:
None

Hadoop Flags:

Incompatible change, Reviewed

Description

The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream. So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away. This is so because it is sure that, this line would be taken care of by some other mapper. This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it. We are currently working on BZip2 codecs where splitting is possible to work with Hadoop. So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)

In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper. So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit. Due to this change, LineRecordReader does not need to move backwards in the stream.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Hadoop-4010.patch
23/Aug/08 00:45
2 kB
Abdul Qadeer
Hadoop-4010_version3.patch
16/Sep/08 07:58
4 kB
Abdul Qadeer
Hadoop-4010_version2.patch
31/Aug/08 06:04
2 kB
Abdul Qadeer
4010-mapreduce.patch
19/Jul/09 21:14
2 kB
Christopher Douglas

Issue Links

is depended upon by

HADOOP-4012 Providing splitting support for bzip2 compressed files

Closed

relates to

HADOOP-3646 Providing bzip2 as codec

Closed

HADOOP-4182 Streaming Documentation Update

Closed

Activity

People

Assignee:: Abdul Qadeer

Reporter:: Abdul Qadeer

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 23/Aug/08 00:39

Updated:: 24/Aug/10 21:14

Resolved:: 21/Jul/09 05:17