Tika
  1. Tika
  2. TIKA-1124

Nested documents not extracted if a PDF file is in the chain

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.5
    • Component/s: general
    • Labels:
      None

      Description

      Tika 1.3 is not able to get attachments from the attached PDF.
      The trunk is able to get attachments from the PDF. However, if that PDF is then embedded in another document, the docs embedded in the PDF are not extracted.

      I'm not sure of a solution, but I found two things that might help with the diagnosis:
      1) If you modify the code in PDFParser so that it doesn't wrap the handler in a BodyContentHandler, everything works (in trunk).
      2) If you modify BodyContentHandler to use my toy SimpleBodyMatchingContentHandler, the problem is also solved.

      The cause may be in the MatchingContentHandler.

      1. TIKA-1124.patch
        12 kB
        Tim Allison
      2. pdf_attachment_issues.zip
        44 kB
        Tim Allison

        Activity

        Hide
        Tim Allison added a comment -

        Added tests (thanks to Nick's advice to use model of POIContainerExtractionTest). Committed r1511901
        and r1511908.

        Show
        Tim Allison added a comment - Added tests (thanks to Nick's advice to use model of POIContainerExtractionTest). Committed r1511901 and r1511908.
        Hide
        Nick Burch added a comment -

        I don't know the PDF code well, but at first glance the patch looks good

        One thing that might be good would be to expand the test a little bit, to check that all the correct parts were found, and in the right order. POIContainerExtractionTest has some examples of doing that for other embedded resources, so might be worth a look for a guide

        Show
        Nick Burch added a comment - I don't know the PDF code well, but at first glance the patch looks good One thing that might be good would be to expand the test a little bit, to check that all the correct parts were found, and in the right order. POIContainerExtractionTest has some examples of doing that for other embedded resources, so might be worth a look for a guide
        Hide
        Tim Allison added a comment -

        Chose to move embedded file code into PDF2XHTML. This allows the proper closing of </body> with the PDF2XHTML's XHTMLContentHandler. Will strip Windows noise before committing, but I wanted to submit this draft in case anyone wants to review it.

        Show
        Tim Allison added a comment - Chose to move embedded file code into PDF2XHTML. This allows the proper closing of </body> with the PDF2XHTML's XHTMLContentHandler. Will strip Windows noise before committing, but I wanted to submit this draft in case anyone wants to review it.
        Hide
        Tim Allison added a comment -

        Ok, I think I figured this out... AbstractOOXML includes contents from embedded documents before calling handler.endDocument()
        PDFParser, however, calls handler.endDocument() and then tries to append content from embedded documents.
        I think this means that the parent handler sees an end of body and therefore does not process the contents of the embedded document.

        trivial fix: move handler.endDocument() out of PDF2XHTML and call it after processing the embedded documents in PDFParser.

        Unless I hear otherwise, I'll commit this over the next few days.

        Show
        Tim Allison added a comment - Ok, I think I figured this out... AbstractOOXML includes contents from embedded documents before calling handler.endDocument() PDFParser, however, calls handler.endDocument() and then tries to append content from embedded documents. I think this means that the parent handler sees an end of body and therefore does not process the contents of the embedded document. trivial fix: move handler.endDocument() out of PDF2XHTML and call it after processing the embedded documents in PDFParser. Unless I hear otherwise, I'll commit this over the next few days.
        Hide
        Tim Allison added a comment -

        If anyone has a chance to look into this, I'd appreciate it. I suspect something is going awry with the recursion in the triggering documents + xpath query in MatchingContentHandler. Thank you!

        Show
        Tim Allison added a comment - If anyone has a chance to look into this, I'd appreciate it. I suspect something is going awry with the recursion in the triggering documents + xpath query in MatchingContentHandler. Thank you!
        Hide
        Tim Allison added a comment -

        outer.docx contains the attached.pdf, which itself contains an attachment. Toy examples of avoiding the use of MatchingContentHandler also attached.

        Show
        Tim Allison added a comment - outer.docx contains the attached.pdf, which itself contains an attachment. Toy examples of avoiding the use of MatchingContentHandler also attached.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development