Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3224

Cache Font Bounding Boxes for Performance in Text Extraction

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: PDModel
    • Labels:
    • Flags:
      Patch

      Description

      Hi,

      I have been using pdfbox by way of Tika for a while for text extraction from PDFs. I had a chance to fire up a profiler recently and found that getBoundingBox() in the PDXXFont.java classes are called fairly frequently – in particular from PDFTextStreamEngine.showGlyph(). I've attached a patch that caches the BoundingBox object alongside the PDFont object inside of PDTextState. There are a variety of other ways to accomplish the same thing – caching inside of the various font objects themselves, etc.

      I wrote a little test program to measure the speed difference against a few randomly selected files. The program just uses PDFTextStripper to retrieve raw text from a PDF.

      Here's what I found:

      ====plain====
      File: BambooCheatSheet.pdf Duration: 60037555619 rate: 81.6 files/sec
      File: flu.pdf Duration: 60019978409 rate: 34.46666666666667 files/sec
      File: megacli_user_guide.pdf Duration: 60641314800 rate: 1.1833333333333333 files/sec
      File: odbc-perl.pdf Duration: 60008216404 rate: 19.466666666666665 files/sec
      File: VerticaArchitectureWhitePaper.pdf Duration: 60084726865 rate: 7.433333333333334 files/sec
      File: WritingaResume.pdf Duration: 60015267784 rate: 59.4 files/sec

      ===boundingbox caching===
      File: BambooCheatSheet.pdf Duration: 60005724588 rate: 106.1 files/sec
      File: flu.pdf Duration: 60021410660 rate: 41.916666666666664 files/sec
      File: megacli_user_guide.pdf Duration: 60107488363 rate: 1.7833333333333334 files/sec
      File: odbc-perl.pdf Duration: 60017784515 rate: 29.9 files/sec
      File: VerticaArchitectureWhitePaper.pdf Duration: 60012261509 rate: 9.05 files/sec
      File: WritingaResume.pdf Duration: 60007995996 rate: 76.5 files/sec

      Cheers

        Attachments

        1. bounding-box-caching.patch
          3 kB
          Tom Callahan
        2. pdfont-bounding-box-caching.patch
          7 kB
          Tom Callahan

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              tom.callahan Tom Callahan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: