Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3503

2.0 much slower than 1.8 for text extraction with certain PDF files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.0.2
    • None
    • None
    • None
    • Windows 2012 R2, Oracle Java 1.8.0_91 64-bit

    Description

      While testing Tika 1.13 we've hit a performance problem with PDF conversion to text for certain files from the GovDocs corpus.

      I have attached an example PDF that shows the problem. The problem can be reproduced using pdxbox-app.jar. Running the extraction with 1.8.12 takes around 1 second:

      java -jar pdfbox-app-1.8.12.jar ExtractText 074031.pdf 074031.pdf.txt

      Doing the same with 2.0.2 takes around 30 seconds:

      java -jar pdfbox-app-2.0.2.jar ExtractText 074031.pdf 074031.pdf.txt

      This is a small PDF so taking 30 seconds seems excessive.

      Attachments

        1. 074031.pdf
          109 kB
          Andy McMullan

        Issue Links

          Activity

            People

              Unassigned Unassigned
              andy@andymcm.com Andy McMullan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: