[PDFBOX-3295] Improve parsing performance of object streams - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.11, 2.0.0, 3.0.0 PDFBox
Fix Version/s: 1.8.12, 2.0.1, 3.0.0 PDFBox
Component/s: Parsing
Labels:
None

Description

Round about a year ago torakiki posted a comment about some xref refactoring on the dev list:

few days ago I was profiling PDFBox when loading medium/large size
documents and I think I found something.
If you try loading the document
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
it takes quite some time and that's mostly spent in the
XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
an object contained in an unparsed object stream is found, the
XrefTrailerResolver performs a full scan of the xref entries found in the
document, in this case hundreds of thousands. If the object streams are
many (like in the given doc), it performs many full scans resulting in poor
performance.

Attachments

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Andreas Lehmkühler

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Mar/16 20:32

Updated:: 25/Mar/17 18:13

Resolved:: 30/Mar/16 17:08