Details
Description
I came across another out-of-spec issue which causes PDFBox to crash. Here's the object:
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360
endstream
endobj
There are numerous issues here. The mediabox doesn't have a closing right square bracket, there's no ">>" to end the dictionary, and there's an "endstream" stuck in there for no apparent reason. This is something I actually found out in the wild, however I do not know if it's a bug in the creation program, some data corruption or how this happened. However, I do know that Adobe Reader parses it without crashing. Since this is not a conforming PDF, the result is undefined, so crashing (which is what PDFBox will eventually do, when trying to process the next object in the file) is a perfectly acceptable thing to do.
However, I'd like to make PDFBox be able to detect that the array is completed when it sees endstream, then ignore the rogue endstream, and then know that the object has ended when it sees "endobj". I'm actually going to go one step further and also accept the same object even if endstream or endobj is missing. In addition to the above object, I also tested it with these objects:
% end obj, without the endstream
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360
endobj
% end endstream, without the endobj
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360
endstream
% properly ended array, dictionary and object (aka conforming PDF)
5 0 obj
<</Type /Page
/Parent 6 0 R
/MediaBox [ 0 0 610.560 783.360 ]
>>
endobj
Although this change will only affect PDFs which do not conform to the spec, I want to put the patch up for review before committing it to SVN since it is a modification to BaseParser.java. If I do not hear any objections/concerns in the few days, I'll go ahead an commit it.