Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.0
-
Patch
Description
Hi,
I have been using pdfbox by way of Tika for a while for text extraction from PDFs. I had a chance to fire up a profiler recently and found that getBoundingBox() in the PDXXFont.java classes are called fairly frequently – in particular from PDFTextStreamEngine.showGlyph(). I've attached a patch that caches the BoundingBox object alongside the PDFont object inside of PDTextState. There are a variety of other ways to accomplish the same thing – caching inside of the various font objects themselves, etc.
I wrote a little test program to measure the speed difference against a few randomly selected files. The program just uses PDFTextStripper to retrieve raw text from a PDF.
Here's what I found:
====plain====
File: BambooCheatSheet.pdf Duration: 60037555619 rate: 81.6 files/sec
File: flu.pdf Duration: 60019978409 rate: 34.46666666666667 files/sec
File: megacli_user_guide.pdf Duration: 60641314800 rate: 1.1833333333333333 files/sec
File: odbc-perl.pdf Duration: 60008216404 rate: 19.466666666666665 files/sec
File: VerticaArchitectureWhitePaper.pdf Duration: 60084726865 rate: 7.433333333333334 files/sec
File: WritingaResume.pdf Duration: 60015267784 rate: 59.4 files/sec
===boundingbox caching===
File: BambooCheatSheet.pdf Duration: 60005724588 rate: 106.1 files/sec
File: flu.pdf Duration: 60021410660 rate: 41.916666666666664 files/sec
File: megacli_user_guide.pdf Duration: 60107488363 rate: 1.7833333333333334 files/sec
File: odbc-perl.pdf Duration: 60017784515 rate: 29.9 files/sec
File: VerticaArchitectureWhitePaper.pdf Duration: 60012261509 rate: 9.05 files/sec
File: WritingaResume.pdf Duration: 60007995996 rate: 76.5 files/sec
Cheers