Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.0.2
-
None
-
None
-
None
-
Windows 2012 R2, Oracle Java 1.8.0_91 64-bit
Description
While testing Tika 1.13 we've hit a performance problem with PDF conversion to text for certain files from the GovDocs corpus.
I have attached an example PDF that shows the problem. The problem can be reproduced using pdxbox-app.jar. Running the extraction with 1.8.12 takes around 1 second:
java -jar pdfbox-app-1.8.12.jar ExtractText 074031.pdf 074031.pdf.txt
Doing the same with 2.0.2 takes around 30 seconds:
java -jar pdfbox-app-2.0.2.jar ExtractText 074031.pdf 074031.pdf.txt
This is a small PDF so taking 30 seconds seems excessive.
Attachments
Attachments
Issue Links
- is duplicated by
-
PDFBOX-3442 OOM for single page pdf file
- Closed