We'll see what Hudson says, but I just pushed the mods to Tika's 2.x branch as well.
A few notes:
1) XMPBox is currently designed to handle PDF/A. There were exceptions on roughly 40% of XMPs extracted from our test corpus. We'll stick with jempbox 1.8.x for now for XMP parsing. We may consider migrating to Adobe's xmpcore. If anyone wants to help make XMPBox more robust, that'd be a huge service. Ref: this email
2) PDFBox 2.0 has gotten rid of the classic parser, and now all parsing is done by the non-sequential parser. In my opinion, the PDFBox devs put a tremendous amount of work into making this new parser quite robust. However, for truncated or other truly damaged files, users may have some luck with the classic parser in 1.8.x.
3) PDFBox 2.0 no longer extracts tiff files. See this exchange, and consider adding the optional dependencies to handle Tiffs, jpeg2000 and ...
Other than those major points, in my opinion, PDFBox 2.0.0 should fix quite a few issues and is far more robust for bidi documents.
Many thanks to the PDFBox devs, especially Andreas Lehmkühler, Maruan Sahyoun and Tilman Hausherr, for their work on PDFBox and on their collaboration on the eval process....more work remains on the latter.