[PDFBOX-5799] Page with thousands of content streams takes extremely long to render or extract - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.2 PDFBox
Fix Version/s: 3.0.3 PDFBox, 4.0.0
Component/s: Rendering, Text extraction
Labels:
- performance

Description

As reported by Erik Branks on the mailing list:

when attempting text extraction from the PDF at https://d-nb.info/1324982411/34 , either using PDFBox 3.0.0 or PDFBox 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not seem to terminate. I cancelled the extraction attempt after roughly 20 minutes. Is this another bad PDF or is there a bug in PDFBox?

This happens with pages 230 and 231 (maybe others). Both have thousands of content streams in the content stream array. The profiler suggests that most time is spent in SequenceRandomAccessRead.seek().

Rendering page 230 with PDFBox 2.0: 50 seconds

Rendering page 230 with PDFBox trunk: 2990 seconds

Rendering page 231 with PDFBox trunk: 4798 seconds

Attachments

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Tilman Hausherr

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Apr/24 17:31

Updated:: 10/Aug/24 07:18

Resolved:: 06/Apr/24 08:21