Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3503

2.0 much slower than 1.8 for text extraction with certain PDF files

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.0.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:
      Windows 2012 R2, Oracle Java 1.8.0_91 64-bit

      Description

      While testing Tika 1.13 we've hit a performance problem with PDF conversion to text for certain files from the GovDocs corpus.

      I have attached an example PDF that shows the problem. The problem can be reproduced using pdxbox-app.jar. Running the extraction with 1.8.12 takes around 1 second:

      java -jar pdfbox-app-1.8.12.jar ExtractText 074031.pdf 074031.pdf.txt

      Doing the same with 2.0.2 takes around 30 seconds:

      java -jar pdfbox-app-2.0.2.jar ExtractText 074031.pdf 074031.pdf.txt

      This is a small PDF so taking 30 seconds seems excessive.

        Attachments

        1. 074031.pdf
          109 kB
          Andy McMullan

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                andy@andymcm.com Andy McMullan
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: