Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3295

Improve parsing performance of object streams

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.11, 2.0.0, 3.0.0 PDFBox
    • Fix Version/s: 1.8.12, 2.0.1, 3.0.0 PDFBox
    • Component/s: Parsing
    • Labels:
      None

      Description

      Round about a year ago Andrea Vacondio posted a comment about some xref refactoring on the dev list:

      few days ago I was profiling PDFBox when loading medium/large size
      documents and I think I found something.
      If you try loading the document
      http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf you'll see
      it takes quite some time and that's mostly spent in the
      XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
      an object contained in an unparsed object stream is found, the
      XrefTrailerResolver performs a full scan of the xref entries found in the
      document, in this case hundreds of thousands. If the object streams are
      many (like in the given doc), it performs many full scans resulting in poor
      performance.

        Attachments

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              lehmi Andreas Lehmkühler
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: