PDFBox
  1. PDFBox
  2. PDFBOX-1297

ExtractText fails to extract text from packaged PDFs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Fedora 13 Linux

      Description

      Apparently a PDF is able to contain multiple files (like a Zip file); it's called
      a PDF Package, described at
      http://help.adobe.com/en_US/Reader/8.0/help.html?content=WSE034CA46-D08F-4fff-AA3C-FF04510DAEF0.html

      I have a simple example PDF Package, containing two sub-PDFs, but ExtractText
      fails to extract their text.

      It does run successfully (no exceptions), but the text it extracts is just the boilerplate text
      saying you should upgrade to Adobe Acrobat version 8 or later to view this PDF.

      1. testPDFPackage.pdf
        90 kB
        Michael McCandless
      2. PDFPackage.pdf
        90 kB
        Michael McCandless
      3. PDFBOX-1297.patch
        7 kB
        Michael McCandless

        Activity

        Hide
        Andreas Lehmkühler added a comment - - edited

        I applied the patch in revision 1339245 as proposed with one exception. I removed the else part about other formats as only pdfs are relevant here.

        Thanks for the contribution!

        Show
        Andreas Lehmkühler added a comment - - edited I applied the patch in revision 1339245 as proposed with one exception. I removed the else part about other formats as only pdfs are relevant here. Thanks for the contribution!
        Hide
        Michael McCandless added a comment -

        Patch, adding embedded PDF handling to ExtractText, plus a test case
        (and test document).

        I would really appreciate someone who's more familiar with PDFBox's
        APIs having a look at what I did... I had to dig into various classes
        that I don't really understand: PDDocumentCatalog,
        PDDocumentNameDictionary, PDEmbeddedFilesNameTreeNode,
        PDComplexFileSpecification, PDEmbeddedFile.

        I only extract text for embedded PDFs but not other content-types.

        I noticed Tika's parser also fails to visit embedded documents within
        a PDF... I'll open a separate issue.

        Show
        Michael McCandless added a comment - Patch, adding embedded PDF handling to ExtractText, plus a test case (and test document). I would really appreciate someone who's more familiar with PDFBox's APIs having a look at what I did... I had to dig into various classes that I don't really understand: PDDocumentCatalog, PDDocumentNameDictionary, PDEmbeddedFilesNameTreeNode, PDComplexFileSpecification, PDEmbeddedFile. I only extract text for embedded PDFs but not other content-types. I noticed Tika's parser also fails to visit embedded documents within a PDF... I'll open a separate issue.
        Hide
        Michael McCandless added a comment -

        Example document showing that ExtractText doesn't extract the text from the embedded documents.

        The document was created with Adobe Acrobat.

        It renders fine with Acrobat Reader: I see the two PDFs and when I click on each I can see their text "PDF1" and "PDF2".

        Show
        Michael McCandless added a comment - Example document showing that ExtractText doesn't extract the text from the embedded documents. The document was created with Adobe Acrobat. It renders fine with Acrobat Reader: I see the two PDFs and when I click on each I can see their text "PDF1" and "PDF2".

          People

          • Assignee:
            Andreas Lehmkühler
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development