Proposed addition to 1.8.2 -> pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> parseHeader() to default the PDF version to 1.4 in situations where it is missing (yes, there really are docs out there like this!).
This prevents an exception caused from a negative substring offset calculation: "String index out of range: -3"
I have floated the question on the email@example.com mailing list (10th October 2013) and it was suggested I default the PDF version to 1.4 in this scenario. I have tested it locally and it works (apparently PDFBox doesn't take the version number into account anyway).
Now over to you guys to decide if this is a good idea or not in the wider scope.
Should you give the green light, I attach:
1) a sample file which causes the exception
2) a patch file
3) patching instructions.
My goal is text extraction, even on broken files (if possible).