Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0.1
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
    • Environment:
      Win7, jdk1.8.0_60 x64

      Description

      While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text extraction application, I noted cpu usage aroung 80% in my 6 core computer when processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec to complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, cpu usage stays aroung 100%. It took 4min37sec to complete. The dataset is read from a ramdrive, so there is no i/o bottleneck. I suspect there is some new synchronization code that blocks the threads for a non trivial amount of time, resulting in less cpu usage than before.

        Attachments

        1. cpu-pdfbox1.8.10.png
          2 kB
          Luis Filipe Nassif
        2. cpu-pdfbox-2.0.2.png
          2 kB
          Luis Filipe Nassif
        3. 000000000000B265.pdf
          94 kB
          Luis Filipe Nassif
        4. tilman-combined-cpu.png
          1 kB
          Tilman Hausherr
        5. cpu_pdfbox_2.0.3_and_1.8.10.png
          7 kB
          Luis Filipe Nassif

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lfcnassif Luis Filipe Nassif
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: