[PDFBOX-1305] Text extraction takes huge amount of time on some files - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: Text extraction
Labels:
None
Environment:
Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. Same result with JDK 7u4 and JDK 6u32

Description

I've got 1.2M single-page PDF files which I'm indexing using Solr (which is using Tika, which is using PDFBox) and some of them takes between 20min up to an hour to index.
This is a huge problem for me, in 48hours I've indexed about 45k files and 19 hours of that time was spent on just 279 files.

I've traced it to PDFBox taking a lot of time extracting the text from the documents.

I've tested extracting the text using pdfbox-app's ExtractText with the same result, the text is extracted but it takes forever...

The attached file took about 23min (using ExtractText) and from the result I can see a lot of "rubbish text" which I don't see in the text extracted from files that takes a normal amount of time (up to a few seconds per file) to parse.

When running truss (on Solaris, strace on Linux) on the java-process, I can see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this problem but I just want to mention it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Manage Attachments

20020101ab3x012a.pdf
09/May/12 14:59
1.55 MB
Roger Håkansson

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Andreas Lehmkühler

Reporter:: Roger Håkansson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/May/12 14:58

Updated:: 07/May/13 19:05

Resolved:: 07/May/13 19:05

Agile

View on Board

Text extraction takes huge amount of time on some files

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment