Let's use this ticket to discuss/prepare for the release and integration of PDFBox 1.8.10 when it is available.
PDFBox can throw StringIndexOutOfBoundsException on some dates
StringIndexOutOfBoundsException when doing DateConverter.parseDate()
Should be able to remove catch blocks around dates once we upgrade to 1.8.10.
Current version of reports attached comparing PDFBox 1.8.9 vs PDFBox 1.8.10 against the PDFs in govdocs1.
Overall takeaway: no new exceptions, no fixed exceptions.
Without looking carefully at the files, it looks like there is a slight improvement in 005937.pdf and 722558.pdf. It looks like there might be a very small regression in 167853.pdf, where 1 instance of respond has become respondæ
I realize now that I should try this again with the PDFBOX-2823 catch blocks removed...doh!
The weird thing is that I can't find any differences with ExtractText and default settings. "respondæ" appears in both extractions. "æ" is an arrow in the PDF.
Interesting. This must be another case of the multi-threading indeterminacy driven by the static caching of fonts in 1.8.x. This also may explain why there were some apparent differences on the recent NaN comparison I ran.
Sorry to waste your time!
SUCCESS: Integrated in tika-trunk-jdk1.7 #798 (See https://builds.apache.org/job/tika-trunk-jdk1.7/798/)
TIKA-1588 upgrade to PDFBox 1.8.10 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692341)