[PDFBOX-695] COSStream doesn't actually stream tokens, causing OOM in larger PDF text extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.2.0
Component/s: Text extraction
Labels:
- OOM
- bug
Environment:
All

Description

Text extraction of certain pdfs has been hanging and/or OOMing. Profiling revealed that PDFStreamEngine.processSubStream() eventually calls PDFStreamParser.getTokens(), which assembles an ArrayList of Tokens. In some cases, this can use over 1GB of memory.

The attached patch replaces PDFStreamParser.getTokens() with PDFStreamParser.getTokensIterator(), which streams the tokens, avoiding the ArrayList build. It only uses this in the call path of org.apache.pdfbox.ExtractText, so the fix may not benefit other usages. Also, API used by the fix may not be ideal.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pdfbox-oom-against-935604.patch
19/Apr/10 23:23
5 kB
Kyle Maxwell

Activity

People

Assignee:: Unassigned

Reporter:: Kyle Maxwell

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Apr/10 23:22

Updated:: 01/Jul/10 07:27

Resolved:: 06/May/10 17:48