Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.0.1
-
None
-
Win7, jdk1.8.0_60 x64
Description
While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text extraction application, I noted cpu usage aroung 80% in my 6 core computer when processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec to complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, cpu usage stays aroung 100%. It took 4min37sec to complete. The dataset is read from a ramdrive, so there is no i/o bottleneck. I suspect there is some new synchronization code that blocks the threads for a non trivial amount of time, resulting in less cpu usage than before.