Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3429

Improve ExtractText Concurrency

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.1
    • None
    • Text extraction
    • Win7, jdk1.8.0_60 x64

    Description

      While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text extraction application, I noted cpu usage aroung 80% in my 6 core computer when processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec to complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, cpu usage stays aroung 100%. It took 4min37sec to complete. The dataset is read from a ramdrive, so there is no i/o bottleneck. I suspect there is some new synchronization code that blocks the threads for a non trivial amount of time, resulting in less cpu usage than before.

      Attachments

        1. 000000000000B265.pdf
          94 kB
          Luís Filipe Nassif
        2. cpu_pdfbox_2.0.3_and_1.8.10.png
          7 kB
          Luís Filipe Nassif
        3. cpu-pdfbox1.8.10.png
          2 kB
          Luís Filipe Nassif
        4. cpu-pdfbox-2.0.2.png
          2 kB
          Luís Filipe Nassif
        5. tilman-combined-cpu.png
          1 kB
          Tilman Hausherr

        Activity

          People

            Unassigned Unassigned
            lfcnassif Luís Filipe Nassif
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: