Details
Description
Fix data correctness issue with TextInputFormat that can occur when reading BZip2 compressed text files. When a file split's range does not include the start position of a BZip2 block, then it is expected to contain no records (i.e. the split is empty). However, if it so happens that the end of this split (exclusive) is at the start of a BZip2 block, then LineRecordReader ends up returning all the records for that BZip2 block. This ends up duplicating records read by a job because the next split would also end up returning all the records for the same block (since its range would include the start of that block).
This bug does not get triggered when the file split's range does include the start of at least one block and ends just before the start of another block. The reason for this has to do with when BZip2CompressionInputStream updates its position when using the BYBLOCK READMODE. Using this read mode, the stream's position while reading only gets updated when reading the first byte past an end of a block marker. The bug is that if the stream, when initialized, was adjusted to be at the end of one block, then we don't update the position after we read the first byte of the next block. Rather, we keep the position to be equal to the next block marker we've initialized to. If the exclusive end position of the split is equal to stream's position, LineRecordReader will continue to read lines until the position is updated (an an additional record in the next block is read if needed).
Attachments
Issue Links
- is related to
-
HADOOP-18321 Fix when to read an additional record from a BZip2 text file split
- Resolved
- links to