Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.0.2 PDFBox
Description
As reported by Erik Branks on the mailing list:
when attempting text extraction from the PDF at https://d-nb.info/1324982411/34 , either using PDFBox 3.0.0 or PDFBox 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not seem to terminate. I cancelled the extraction attempt after roughly 20 minutes. Is this another bad PDF or is there a bug in PDFBox?
This happens with pages 230 and 231 (maybe others). Both have thousands of content streams in the content stream array. The profiler suggests that most time is spent in SequenceRandomAccessRead.seek().
Rendering page 230 with PDFBox 2.0: 50 seconds
Rendering page 230 with PDFBox trunk: 2990 seconds
Rendering page 231 with PDFBox trunk: 4798 seconds