PDFBox
  1. PDFBox
  2. PDFBOX-604

Various text extraction performance improvements

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      Even after Mel's recent patches I've found a number of small performance bottlenecks that we could get rid of.

        Activity

        Hide
        Jukka Zitting added a comment -

        See revisions 899802, 899804, 899806, 899807 and 899810 for the improvements I made. This covers pretty much all of the remaining immediate simple bottlenecks I could find through profiling, so I'm resolving this issue as fixed.

        The biggest higher level performance bottleneck is the way o.a.p.util.PDFStreamEngine.processEncodedText() processes each glyph separately. We would likely see major performance improvements if we refactor things so that the entire
        string of encoded glyphs is first decoded as a single operation and then any graphics transformations are applied to
        that whole block before processing the characters. That, however, is best handled as a separate issue.

        Show
        Jukka Zitting added a comment - See revisions 899802, 899804, 899806, 899807 and 899810 for the improvements I made. This covers pretty much all of the remaining immediate simple bottlenecks I could find through profiling, so I'm resolving this issue as fixed. The biggest higher level performance bottleneck is the way o.a.p.util.PDFStreamEngine.processEncodedText() processes each glyph separately. We would likely see major performance improvements if we refactor things so that the entire string of encoded glyphs is first decoded as a single operation and then any graphics transformations are applied to that whole block before processing the characters. That, however, is best handled as a separate issue.
        Hide
        Andreas Lehmkühler added a comment -

        closed after releasing version 1.0.0

        Show
        Andreas Lehmkühler added a comment - closed after releasing version 1.0.0

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development