The issue is related to
MAPREDUCE-6481. That jira changed the position calculation and made sure that the full records are returned by the reader as expected. It did not anticipate the record duplication. Junit tests also did not cover the use cases correctly to discover the issue.
The problem is limited to multi byte delimiters only as far as I can trace.
The junit tests for the multi byte delimiter only take the best case scenario into account. The input data contained the exact delimiter and no ambiguous characters. As soon as the test is changed, either the delimiter or the input data, a failure will be triggered. The issue with the failure is that it does not clearly show when and how it fails. Analysis of the test failures shows that a complex combination of input data, split and buffer size will trigger a failure.
Based on testing the duplication of the record occurs only if:
- the first character(s) of the delimiter are part of the record data, example:
1) the delimiter is += and the data contains a + and is not followed by =
2) the delimiter is +=+= and the data contains +=+ and is not followed by =
- the delimiter character is found at the split boundary: the last character before the split ends
- a fill of the buffer is triggered to finish processing the record
The underlying problem is that we set a flag called needAdditionalRecord in the UncompressedSplitLineReader when we fill the buffer and have encountered part of a delimiter in combination with a split. We keep track of this in the ambiguous character number. However is it turns out that if the character(s) found after that point do not belong to a delimiter we do not unset the needAdditionalRecord. This causes the next record to be read twice and thus we see a duplication of records.
The solution would be to unset the flag when we detect that we're not processing a delimiter. We currently only add the ambiguous characters to the record read and set the number back to 0. At the same point we need to unset the flag.
The patch was developed based on junit tests that exercise the split and buffer settings in combination with multiple delimiter types using different inputs. All cases now provide a consistent count of records and correct position inside the data.