[MAPREDUCE-6558] multibyte delimiters with compressed input files generate duplicate records - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.2
Fix Version/s: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
Component/s: mrv1, mrv2
Labels:
None

Target Version/s:

2.8.0, 2.7.3, 2.6.5
Hadoop Flags:

Reviewed

Description

This is the follow up for ~~MAPREDUCE-6549~~. Compressed files cause record duplications as shown in different junit tests. The number of duplicated records changes with the splitsize:

Unexpected number of records in split (splitsize = 10)
Expected: 41051
Actual: 45062

Unexpected number of records in split (splitsize = 100000)
Expected: 41051
Actual: 41052

Test passes with splitsize = 147445 which is the compressed file length.The file is a bzip2 file with 100k blocks and a total of 11 blocks

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6558.1.patch
10/May/16 16:04
322 kB
wilfreds#1
MAPREDUCE-6558.2.patch
13/May/16 01:14
16 kB
wilfreds#1
MAPREDUCE-6558.3.patch
13/May/16 05:14
7 kB
wilfreds#1

Issue Links

is related to

MAPREDUCE-6549 multibyte delimiters with LineRecordReader cause duplicate records

Closed

Activity

People

Assignee:: Wilfred Spiegelenburg

Reporter:: Wilfred Spiegelenburg

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/Nov/15 10:41

Updated:: 26/Feb/20 05:51

Resolved:: 13/May/16 14:40