Actually I agree with Rushabh that there are at least two somewhat different problems here. The original problem reported in the JIRA has to do with records being dropped with uncompressed inputs. We should fix that issue so we don't drop data when using an uncompressed input. I'm assuming Rushabh's patch solves that issue, but I haven't looked at it in detail just yet.
There's another issue related to mistaken record delimiter recognition where the subsequent split reader can accidentally think it found a delimiter when in fact the real record delimiter is somewhere else. If the subsequent split reader sees 'xxxxyzxxx' at the beginning of its split then it will toss out the first record (i.e.: the first 'xxx') then read 'xyz' as the next record. However that may or may not be the correct behavior, because with that kind of delimiter and data the correct behavior depends upon the previous split's data. If the previous split ended with 'abc' then the behavior was correct and there are two records in the stream: 'abc' and 'xyz'. If the previous split ended with 'abcx' then that's the incorrect behavior. The records should be 'abc' and 'xxyz' but the second split reader will report an 'xyz' record that shouldn't exist.
To solve that problem either a split reader would have to examine the prior-split's data to distinguish this case, or the split reader would have to realize it's an ambiguous situation and leave the record processing to the previous split reader to handle. The former can be very expensive if the prior split is compressed, as it has to potentially unpack the entire split. Also this can get very tricky and a reader may need to read more than one other split to resolve it. For example, if the data stream is 'axxxxxxxxxxxxx......xxxxxxbxxxxxx......xxxxxcxxxxxx' then a reader may have to scan far down into subsequent splits since only it knows where the true record boundaries are. Simply tacking on an extra character at the beginning of that input changes where the record boundaries are and the record contents even the last split in the input. Solving this requires a different high-level algorithm to split processing than what we have today (i.e.: throw away the first record and go), so I believe that's something better left to a followup JIRA.
It'd be nice to solve the dropped-record problem for scenarios where we don't have to worry about mistaken record delimiter recognition in the data, as that's an incremental improvement from where we are today. I'll try to get some time to review the latest patch and provide comments soon.