Details
Description
Fix data correctness issue with TextInputFormat that can occur when reading BZip2 compressed text files. When triggered this bug would cause a split to return the first record of the succeeding split that reads the next BZip2 block, thereby duplicating that record.
When the bug is triggered?
The condition for the bug to occur requires that the flag "needAdditionalRecord" in CompressedSplitLineReader to be set to true by #fillBuffer at an inappropriate time: when we haven't read the remaining bytes of split. This can happen when the inDelimiter parameter is true while #fillBuffer is invoked while reading the next line. The inDelimiter parameter is true when either 1) the last byte of the buffer is a CR character ('\r') if using the default delimiters, or 2) the last bytes of the buffer are a common prefix of the delimiter if using a custom delimiter.
This can occur in various edge cases, illustrated by five unit tests added in this change – specifically the five that would fail without the fix are as listed below:
- BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
- BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
- BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
- BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
- BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
For background, the purpose of "needAdditionalRecord" field in CompressedSplitLineReader is to indicate to LineRecordReader via the #needAdditionalRecordAfterSplit method that an extra record lying beyond the split range should be included in the split. This complication arises due to a problem when splitting text files. When a split starts at a position greater than zero, we do not know whether the first line belongs to the last record in the prior split or is a new record. The solution done in Hadoop is to make splits that start at position greater than zero to always discard the first line and then have the prior split decide whether it should include the first line of the next split or not (as part of the last record or as a new record). This works well even in the case of a single line spanning multiple splits.
What is the fix?
The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled to the buffer are not the bytes immediately outside the range of the split.
When reading compressed data, CompressedSplitLineReader requires/assumes that the stream's #read method never returns bytes from more than one compression block at a time. This ensures that #fillBuffer gets invoked to read the first byte of the next block. This next block may or may not be part of the split we are reading. If we detect that the last bytes of the prior block maybe part of a delimiter, then we may decide that we should read an additional record, but we should only do that when this next block is not part of our split and we aren't filling the buffer again beyond our split range. This is because we are only concerned whether the we need to read the very first line of the next split as a separate record. If it going to be part of the last record, then we don't need to read an extra record, or in the special case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, it will be treated as an empty line, thus we don't need to include an extra record into the mix.
Thus, to emphasize. It is when we read the first bytes outside our split range that matters. But the current logic doesn't take that into account in CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader which does.
Attachments
Issue Links
- causes
-
HADOOP-18390 Fix out of sync import for HADOOP-18321
- Resolved
- relates to
-
HADOOP-18400 Fix file split duplicating records from a succeeding split when reading BZip2 text files
- Resolved
- links to