Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.2.0
-
All
Description
Text extraction of certain pdfs has been hanging and/or OOMing. Profiling revealed that PDFStreamEngine.processSubStream() eventually calls PDFStreamParser.getTokens(), which assembles an ArrayList of Tokens. In some cases, this can use over 1GB of memory.
The attached patch replaces PDFStreamParser.getTokens() with PDFStreamParser.getTokensIterator(), which streams the tokens, avoiding the ArrayList build. It only uses this in the call path of org.apache.pdfbox.ExtractText, so the fix may not benefit other usages. Also, API used by the fix may not be ideal.