Tika
  1. Tika
  2. TIKA-1124

Nested documents not extracted if a PDF file is in the chain

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.5
    • Component/s: general
    • Labels:
      None

      Description

      Tika 1.3 is not able to get attachments from the attached PDF.
      The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted.

      I'm not sure of a solution, but I found two things that might help with the diagnosis:
      1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk).
      2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved.

      The cause may be in the MatchingContentHandler.

      1. TIKA-1124.patch
        12 kB
        Tim Allison
      2. pdf_attachment_issues.zip
        44 kB
        Tim Allison

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development