Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5799

Page with thousands of content streams takes extremely long to render or extract

    XMLWordPrintableJSON

Details

    Description

      As reported by Erik Branks on the mailing list:

      when attempting text extraction from the PDF at https://d-nb.info/1324982411/34 , either using PDFBox 3.0.0 or PDFBox 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not seem to terminate. I cancelled the extraction attempt after roughly 20 minutes. Is this another bad PDF or is there a bug in PDFBox?

      This happens with pages 230 and 231 (maybe others). Both have thousands of content streams in the content stream array. The profiler suggests that most time is spent in SequenceRandomAccessRead.seek().

      Rendering page 230 with PDFBox 2.0: 50 seconds

      Rendering page 230 with PDFBox trunk: 2990 seconds

      Rendering page 231 with PDFBox trunk: 4798 seconds

      Attachments

        Activity

          People

            lehmi Andreas Lehmkühler
            tilman Tilman Hausherr
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: