[MAPREDUCE-5948] org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.20.2, 0.23.9, 2.2.0
Fix Version/s: 2.8.0, 2.7.2, 2.6.3, 3.0.0-alpha1
Component/s: None
Labels:
None
Environment:

CDH3U2 Redhat linux 5.7

Target Version/s:

2.8.0
Hadoop Flags:

Reviewed

Description

Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input.

This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-9867.patch
25/Jun/14 13:57
19 kB
Rushabh Shah
HADOOP-9867.patch
10/Dec/13 03:12
14 kB
Vinayakumar B
HADOOP-9867.patch
20/Nov/13 09:58
10 kB
Vinayakumar B
HADOOP-9867.patch
20/Nov/13 09:49
9 kB
Vinayakumar B
MAPREDUCE-5948.002.patch
09/Jun/15 23:49
20 kB
Akira Ajisaka
MAPREDUCE-5948.003.patch
10/Jun/15 16:32
20 kB
Akira Ajisaka

Issue Links

is related to

HADOOP-13192 org.apache.hadoop.util.LineReader cannot handle multibyte delimiters correctly

Closed

MAPREDUCE-5656 bzip2 codec can drop records when reading data in splits

Closed

MAPREDUCE-6481 LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

Closed

relates to

HADOOP-11445 Bzip2Codec: Data block is skipped when position of newly created stream is equal to start of split

Closed

Activity

People

Assignee:: Akira Ajisaka

Reporter:: Kris Geusebroek

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 13/Aug/13 12:33

Updated:: 06/Jan/17 00:46

Resolved:: 22/Jun/15 22:01