Tika
  1. Tika
  2. TIKA-738

Tika fails to extract text from PDF annotations

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      Spinoff from TIKA-717.

      1. TIKA-738.patch
        4 kB
        Michael McCandless
      2. TIKA-738.patch
        8 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Per discussion on tika-dev I'll leave this issue closed, and commit this fix under TIKA-778 instead.

        Show
        Michael McCandless added a comment - Per discussion on tika-dev I'll leave this issue closed, and commit this fix under TIKA-778 instead.
        Hide
        Michael McCandless added a comment -

        Patch, fixing the excess </p> tag.

        Show
        Michael McCandless added a comment - Patch, fixing the excess </p> tag.
        Hide
        Michael McCandless added a comment -

        Reopening per the discussion on tika-dev; it looks like this fix also caused the NPE in TIKA-778.

        Show
        Michael McCandless added a comment - Reopening per the discussion on tika-dev; it looks like this fix also caused the NPE in TIKA-778 .
        Hide
        Michael McCandless added a comment -

        I'll open a separate issue to also address TODOs on next PDFBox upgrade.

        Show
        Michael McCandless added a comment - I'll open a separate issue to also address TODOs on next PDFBox upgrade.
        Hide
        Michael McCandless added a comment -

        Patch, extracting text from annotations; I added an option to PDFParser to turn this on/off, and I re-enabled the test case and it now passes.

        Show
        Michael McCandless added a comment - Patch, extracting text from annotations; I added an option to PDFParser to turn this on/off, and I re-enabled the test case and it now passes.
        Hide
        Michael McCandless added a comment -

        I opened PDFBOX-1143 to improve PDFTextStripper so that it visits text annotations.

        I also worked out a simple patch to PDF2XHTML to directly extract the annotations ourselves until PDFBOX-1143 is fixed.

        Show
        Michael McCandless added a comment - I opened PDFBOX-1143 to improve PDFTextStripper so that it visits text annotations. I also worked out a simple patch to PDF2XHTML to directly extract the annotations ourselves until PDFBOX-1143 is fixed.
        Hide
        Michael McCandless added a comment -

        I moved the failing (but ignored) test case into PDFParserTest.

        Browsing through PDFBox's sources it seems to have alot of code around handling of annotations so hopefully it's just a matter of Tika tapping into this...

        Show
        Michael McCandless added a comment - I moved the failing (but ignored) test case into PDFParserTest. Browsing through PDFBox's sources it seems to have alot of code around handling of annotations so hopefully it's just a matter of Tika tapping into this...

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development