Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1297

ExtractText fails to extract text from packaged PDFs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6.0
    • 1.7.0
    • Text extraction
    • None
    • Fedora 13 Linux

    Description

      Apparently a PDF is able to contain multiple files (like a Zip file); it's called
      a PDF Package, described at
      http://help.adobe.com/en_US/Reader/8.0/help.html?content=WSE034CA46-D08F-4fff-AA3C-FF04510DAEF0.html

      I have a simple example PDF Package, containing two sub-PDFs, but ExtractText
      fails to extract their text.

      It does run successfully (no exceptions), but the text it extracts is just the boilerplate text
      saying you should upgrade to Adobe Acrobat version 8 or later to view this PDF.

      Attachments

        1. PDFPackage.pdf
          90 kB
          Michael McCandless
        2. PDFBOX-1297.patch
          7 kB
          Michael McCandless
        3. testPDFPackage.pdf
          90 kB
          Michael McCandless

        Activity

          People

            lehmi Andreas Lehmkühler
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: