I attached a unit test for this issue. You can reproduce the problem if you run the unit test.
First, this issue happens when position of newly created stream is equal to start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, the issue I am reporting does not happen when we run these tests because this issue happens only when the start of split byte block includes both block marker and compressed data.
BZip2 block marker - 0x314159265359 (001100010100000101011001001001100101001101011001)
blockEndingInCR.txt.bz2 (Start of Split - 136504):
Test bz2 File (Start of Split - 203426)
Let's say a job splits this test bz2 file into two splits at the start of split (position 203426).
The former split does not read records which start position 203426 because BZip2 says the position of these dropped records is 203427. The latter split does not read the records because BZip2CompressionInputStream read the block from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.