[PDFBOX-1093] Copy Page from one Document to another: Page Content Stream Linked to Original Document - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.8.2
Component/s: Text extraction
Labels:
None

Description

When a page is grabbed from one document and added to another (via addPage or importPage) the Content Stream of the page retains the scratchFile and unFiltered/FilteredStreams from it's original document. This means that a page is always connected to it's original document and not wholly a part of it's new document.

The problem with this situation:
-When searching for text within a large (800,000 page) pdf file performance can potentially be increased if the pdf file is split into single pages for incremental text extraction. Each page is searched individually rather than an entire document search. To achieve this, a new document is created and a single page from the original pdf is added.

-When searching through these 1 page documents, the scratchFile of the original pdf is used, and it will grow as the text from each page is extracted. This leads to an out of memory condition, which appears as a "SEVERE Stop reading corrupt stream" exception from doDecode() as the write buffer attempts to expand to a size greater than the maximum heap size.

A workaround for this problem is to create a new document, add the page to the document, save the document, close it and then load it again. Unfortunately the performance cost of this workaround is prohibitive.

Attachments

Issue Links

depends upon

PDFBOX-1586 IndexOutOfBoundsException when saving a document (at random)

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: eddie.greene@gmail.com

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Aug/11 20:01

Updated:: 02/Jun/13 13:35

Resolved:: 05/May/13 13:09