[MAPREDUCE-6549] multibyte delimiters with LineRecordReader cause duplicate records - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.2
Fix Version/s: 2.8.0, 2.7.2, 2.6.3, 3.0.0-alpha1
Component/s: mrv1, mrv2
Labels:
None

Hadoop Flags:

Reviewed

Description

LineRecorderReader currently produces duplicate records under certain scenarios such as:

1) input string: "abc++defghi+"
delimiter string: "+++"
test passes with all sizes of the split
2) input string: "abc+def+ghi+"
delimiter string: "+++"
test fails with a split size of 4
2) input string: "abc++defghi+"
delimiter string: "++"
test fails with a split size of 5
3) input string "abc++defghij+"
delimiter string: "++"
test fails with a split size of 4
4) input string "abc+def+ghi+"
delimiter string: "++"
test fails with a split size of 9

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6549-2.patch
15/Nov/15 23:02
30 kB
wilfreds#1
MAPREDUCE-6549-1.patch
14/Nov/15 21:39
3 kB
Dustin Cote
MAPREDUCE-6549.3.patch
25/Nov/15 00:38
31 kB
wilfreds#1

Issue Links

is duplicated by

MAPREDUCE-6891 TextInputFormat: duplicate records with custom delimiter

Resolved

relates to

MAPREDUCE-6481 LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.

Closed

MAPREDUCE-6558 multibyte delimiters with compressed input files generate duplicate records

Closed

Activity

People

Assignee:: Wilfred Spiegelenburg

Reporter:: Dustin Cote

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 14/Nov/15 21:32

Updated:: 26/Feb/20 05:50

Resolved:: 26/Nov/15 01:06