[MAPREDUCE-6481] LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.7.0
Fix Version/s: 2.8.0, 2.7.2, 2.6.3, 3.0.0-alpha1
Component/s: mrv2
Labels:
None

Hadoop Flags:

Reviewed

Description

LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.
There are two issues:

LineRecordReader may give incomplete record: some characters cut off at the end of record.
LineRecordReader may give wrong position/key information.

The first issue only happens for Custom Delimiter, which is caused by the following code at LineReader#readCustomLine:

    if (appendLength > 0) {
        if (ambiguousByteCount > 0) {
          str.append(recordDelimiterBytes, 0, ambiguousByteCount);
          //appending the ambiguous characters (refer case 2.2)
          bytesConsumed += ambiguousByteCount;
          ambiguousByteCount=0;
        }
        str.append(buffer, startPosn, appendLength);
        txtLength += appendLength;
      }

If appendLength is 0 and ambiguousByteCount is not 0, this bug will be triggered. For example, input is "123456789aab", Custom Delimiter is "ab", bufferSize is 10 and splitLength is 12, the correct record should be "123456789a" with length 10, but we get incomplete record "123456789" with length 9 from current code.

The second issue can happen for both Custom Delimiter and Default Delimiter, which is caused by the code in UncompressedSplitLineReader#readLine. UncompressedSplitLineReader#readLine may report wrong size information at some corner cases. The reason is unusedBytes in the following code:

bytesRead += unusedBytes;
unusedBytes = bufferSize - getBufferPosn();
bytesRead -= unusedBytes;

If the last bytes read (bufferLength) is less than bufferSize, the previous unusedBytes will be wrong, which should be bufferLength - bufferPosn instead of bufferSize - bufferPosn. It will return larger value.
For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", bufferSize is 10 and two splits:first splitLength is 15 and second splitLength 4:
the current code will give the following result:
First record: Key:0 Value:"1234567890"
Second record: Key:12 Value:"12"
Third Record: Key:21 Value:"345"
You can see the Key for the third record is wrong, it should be 16 instead of 21. It is due to wrong unusedBytes. fillBuffer read 10 bytes for the first time, for the second times, it only read 5 bytes, which is 5 bytes less than the bufferSize. That is why the key we get is 5 bytes larger than the correct one.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-6481.000.patch
17/Sep/15 06:05
19 kB
Zhihai Xu

Issue Links

is related to

MAPREDUCE-6549 multibyte delimiters with LineRecordReader cause duplicate records

Closed

relates to

MAPREDUCE-5948 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

Closed

Activity

People

Assignee:: Zhihai Xu

Reporter:: Zhihai Xu

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 17/Sep/15 05:52

Updated:: 06/Jan/17 00:55

Resolved:: 17/Sep/15 14:33