Nigel, sorry I should have included the hadoop unit test results in my comments. We did test the patch a lot internally here. Also I ran the hadoop unit test myself.
The reason for not adding a separate test is just like Rodrigo has said. The data that corrupts the current implementation is about 1MB in size but we cannot disclose it. There is another public data set that breaks the old code, but it is about 20MB in size. I don't think we want to include that big amount of data into the hadoop codebase.
Also as you can see, the patch is just 2 bytes inside the BZip2 algorithm itself, literally.
I will definitely be more carefully next time.