Spinoff from TIKA-717.
Per discussion on tika-dev I'll leave this issue closed, and commit this fix under TIKA-778 instead.
Patch, fixing the excess </p> tag.
Reopening per the discussion on tika-dev; it looks like this fix also caused the NPE in TIKA-778.
I'll open a separate issue to also address TODOs on next PDFBox upgrade.
Patch, extracting text from annotations; I added an option to PDFParser to turn this on/off, and I re-enabled the test case and it now passes.
I opened PDFBOX-1143 to improve PDFTextStripper so that it visits text annotations.
I also worked out a simple patch to PDF2XHTML to directly extract the annotations ourselves until PDFBOX-1143 is fixed.
I moved the failing (but ignored) test case into PDFParserTest.
Browsing through PDFBox's sources it seems to have alot of code around handling of annotations so hopefully it's just a matter of Tika tapping into this...