Affects Version/s: 2.2.0
Fix Version/s: None
When using a custom delimiter for TextInputFormat, the resulting blocks are not correct under some circumstances. It happens that the total number of records is wrong and some entries are duplicated.
I have created a reproducible test case:
Generate a File
Java-Test to reproduce the error
This example fails with the error
java.lang.AssertionError: expected:<10000000> but was:<10042616>
when commenting out the Assert about the size of the collection, my log output ends like this:
[main] INFO edu.udo.cs.schaefer.testspark.Main - Correct value for index 663244: expected 663245 -> got 663245
[main] ERROR edu.udo.cs.schaefer.testspark.Main - Wrong value for index 663245: expected 663246 -> got 660111
After the the wrong value for index 663245 the values are sorted again an a continuing with 660112, 660113, ....
The error is not reproducible with \n as delimiter, i.e. when not using a custom delimiter.