Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3224

Cache Font Bounding Boxes for Performance in Text Extraction

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.0.0
    • PDModel
    • Patch

    Description

      Hi,

      I have been using pdfbox by way of Tika for a while for text extraction from PDFs. I had a chance to fire up a profiler recently and found that getBoundingBox() in the PDXXFont.java classes are called fairly frequently – in particular from PDFTextStreamEngine.showGlyph(). I've attached a patch that caches the BoundingBox object alongside the PDFont object inside of PDTextState. There are a variety of other ways to accomplish the same thing – caching inside of the various font objects themselves, etc.

      I wrote a little test program to measure the speed difference against a few randomly selected files. The program just uses PDFTextStripper to retrieve raw text from a PDF.

      Here's what I found:

      ====plain====
      File: BambooCheatSheet.pdf Duration: 60037555619 rate: 81.6 files/sec
      File: flu.pdf Duration: 60019978409 rate: 34.46666666666667 files/sec
      File: megacli_user_guide.pdf Duration: 60641314800 rate: 1.1833333333333333 files/sec
      File: odbc-perl.pdf Duration: 60008216404 rate: 19.466666666666665 files/sec
      File: VerticaArchitectureWhitePaper.pdf Duration: 60084726865 rate: 7.433333333333334 files/sec
      File: WritingaResume.pdf Duration: 60015267784 rate: 59.4 files/sec

      ===boundingbox caching===
      File: BambooCheatSheet.pdf Duration: 60005724588 rate: 106.1 files/sec
      File: flu.pdf Duration: 60021410660 rate: 41.916666666666664 files/sec
      File: megacli_user_guide.pdf Duration: 60107488363 rate: 1.7833333333333334 files/sec
      File: odbc-perl.pdf Duration: 60017784515 rate: 29.9 files/sec
      File: VerticaArchitectureWhitePaper.pdf Duration: 60012261509 rate: 9.05 files/sec
      File: WritingaResume.pdf Duration: 60007995996 rate: 76.5 files/sec

      Cheers

      Attachments

        1. pdfont-bounding-box-caching.patch
          7 kB
          Tom Callahan
        2. bounding-box-caching.patch
          3 kB
          Tom Callahan

        Activity

          People

            tilman Tilman Hausherr
            tom.callahan Tom Callahan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: