Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1297

ExtractText fails to extract text from packaged PDFs

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Fedora 13 Linux

      Description

      Apparently a PDF is able to contain multiple files (like a Zip file); it's called
      a PDF Package, described at
      http://help.adobe.com/en_US/Reader/8.0/help.html?content=WSE034CA46-D08F-4fff-AA3C-FF04510DAEF0.html

      I have a simple example PDF Package, containing two sub-PDFs, but ExtractText
      fails to extract their text.

      It does run successfully (no exceptions), but the text it extracts is just the boilerplate text
      saying you should upgrade to Adobe Acrobat version 8 or later to view this PDF.

        Attachments

        1. testPDFPackage.pdf
          90 kB
          Michael McCandless
        2. PDFBOX-1297.patch
          7 kB
          Michael McCandless
        3. PDFPackage.pdf
          90 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: