[PDFBOX-3429] Improve ExtractText Concurrency - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.0.1
Fix Version/s: None
Component/s: Text extraction
Labels:
- optimization
Environment:
Win7, jdk1.8.0_60 x64

Description

While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text extraction application, I noted cpu usage aroung 80% in my 6 core computer when processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec to complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, cpu usage stays aroung 100%. It took 4min37sec to complete. The dataset is read from a ramdrive, so there is no i/o bottleneck. I suspect there is some new synchronization code that blocks the threads for a non trivial amount of time, resulting in less cpu usage than before.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

000000000000B265.pdf
21/Jul/16 21:22
94 kB
Luís Filipe Nassif
cpu_pdfbox_2.0.3_and_1.8.10.png
22/Jul/16 18:41
7 kB
Luís Filipe Nassif
cpu-pdfbox1.8.10.png
21/Jul/16 21:20
2 kB
Luís Filipe Nassif
cpu-pdfbox-2.0.2.png
21/Jul/16 21:20
2 kB
Luís Filipe Nassif
tilman-combined-cpu.png
22/Jul/16 17:53
1 kB
Tilman Hausherr

Activity

People

Assignee:: Unassigned

Reporter:: Luís Filipe Nassif

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Jul/16 02:02

Updated:: 26/Jul/16 16:55