Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.6.0
-
None
Description
When a page is grabbed from one document and added to another (via addPage or importPage) the Content Stream of the page retains the scratchFile and unFiltered/FilteredStreams from it's original document. This means that a page is always connected to it's original document and not wholly a part of it's new document.
The problem with this situation:
-When searching for text within a large (800,000 page) pdf file performance can potentially be increased if the pdf file is split into single pages for incremental text extraction. Each page is searched individually rather than an entire document search. To achieve this, a new document is created and a single page from the original pdf is added.
-When searching through these 1 page documents, the scratchFile of the original pdf is used, and it will grow as the text from each page is extracted. This leads to an out of memory condition, which appears as a "SEVERE Stop reading corrupt stream" exception from doDecode() as the write buffer attempts to expand to a size greater than the maximum heap size.
A workaround for this problem is to create a new document, add the page to the document, save the document, close it and then load it again. Unfortunately the performance cost of this workaround is prohibitive.
Attachments
Issue Links
- depends upon
-
PDFBOX-1586 IndexOutOfBoundsException when saving a document (at random)
- Closed