Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4896

Don't save and restore graphic states around showGlyph in LegacyPDFStreamEngine

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      One of the major performance bottlenecks in text extraction was the

      clone + push and the pop + clone operations on the graphic state before and after the call to showGlyph.

      Not only it was slow to clone, it also consumes large amounts of memory making the garbage collector work harder.

      When extracting text, showGlyph does not modify the graphic state so there's no need to save / restore the state.

      The same could be true in general, not just for text extraction, but I do not understand the code well enough to decide.

      I have only modified the behavior for the LegacyPDFStreamEngine, to be safe.

      The showGlyph operation sounds like a read only operation, that should not modify anything.

       

      I have the code ready and I will submit a patch and a review.

      Attachments

        1. pdfbox.png
          40 kB
          Alfred
        2. PDFBOX-4896.patch
          4 kB
          Alfred

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lehmi Andreas Lehmkühler
            Faltiska Alfred
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment