PDFBox
  1. PDFBox
  2. PDFBOX-1297

ExtractText fails to extract text from packaged PDFs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Fedora 13 Linux

      Description

      Apparently a PDF is able to contain multiple files (like a Zip file); it's called
      a PDF Package, described at
      http://help.adobe.com/en_US/Reader/8.0/help.html?content=WSE034CA46-D08F-4fff-AA3C-FF04510DAEF0.html

      I have a simple example PDF Package, containing two sub-PDFs, but ExtractText
      fails to extract their text.

      It does run successfully (no exceptions), but the text it extracts is just the boilerplate text
      saying you should upgrade to Adobe Acrobat version 8 or later to view this PDF.

      1. testPDFPackage.pdf
        90 kB
        Michael McCandless
      2. PDFBOX-1297.patch
        7 kB
        Michael McCandless
      3. PDFPackage.pdf
        90 kB
        Michael McCandless

        Activity

          People

          • Assignee:
            Andreas Lehmkühler
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development