[HADOOP-11445] Bzip2Codec: Data block is skipped when position of newly created stream is equal to start of split - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.7.0
Component/s: None
Labels:
None

Target Version/s:

2.7.0
Hadoop Flags:

Reviewed

Description

bz2 input files are handled by FileInputFormat+LineRecordReader. In LineRecordReader, bz2 specific compressed input stream is created to iterate over records. After every new creation, the stream points to the beginning of next data block. The logic to find the beginning of next block depends on start of the split. The search begins at 10 bytes behind the start of split. If the first search creates input stream whose position is before or at start of split, next block beginning is sought (assuming that the record reader for previous split would have already iterated over the the data block in which current start of split lies). If the split start is just at the byte where a newly created stream is positioned (start of data block), attempt is made to find beginning of next data block. This doesn't seem correct because this will result in jumping a whole block and will result in missing records.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-11445.001.patch
23/Dec/14 18:37
3 kB
Ankit Kamboj

Issue Links

breaks

HADOOP-13270 BZip2CompressionInputStream finds the same compression marker twice in corner case, causing duplicate data blocks

Closed

is related to

MAPREDUCE-5948 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Closed

Activity

People

Assignee:: Ankit Kamboj

Reporter:: Ankit Kamboj

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Dec/14 18:33

Updated:: 14/Jun/16 01:01

Resolved:: 06/Jan/15 21:21