[PDFBOX-1297] ExtractText fails to extract text from packaged PDFs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: Text extraction
Labels:
None
Environment:
Fedora 13 Linux

Description

Apparently a PDF is able to contain multiple files (like a Zip file); it's called
a PDF Package, described at
http://help.adobe.com/en_US/Reader/8.0/help.html?content=WSE034CA46-D08F-4fff-AA3C-FF04510DAEF0.html

I have a simple example PDF Package, containing two sub-PDFs, but ExtractText
fails to extract their text.

It does run successfully (no exceptions), but the text it extracts is just the boilerplate text
saying you should upgrade to Adobe Acrobat version 8 or later to view this PDF.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFPackage.pdf
26/Apr/12 16:57
90 kB
Michael McCandless
PDFBOX-1297.patch
04/May/12 10:39
7 kB
Michael McCandless
testPDFPackage.pdf
04/May/12 10:39
90 kB
Michael McCandless

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Apr/12 16:55

Updated:: 29/May/12 16:21

Resolved:: 16/May/12 16:08