[PDFBOX-956] Poor text extraction performance in PDFTextStripper.java - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.7.0
Component/s: Text extraction
Labels:
None

Description

The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
The patch is to use a TreeMap to achieve O(N log N) performance.
The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.

BTW: The extracted text is also quite different compared to Adobe Reader. Not sure which is correct but for this document it doesn't matter.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFTextStripper.java.patch
05/Feb/11 02:38
4 kB
Kevin Jackson
c4ce2fcd_69.pdf
11/Feb/11 04:06
2.85 MB
Kevin Jackson
PDFBOX956-c4ce2fcd_69.txt
12/Feb/11 18:30
611 kB
Andreas Lehmkühler
PDFTextStripper.pdf
14/Mar/11 14:19
29 kB
Stefan Magnus Landrø

Issue Links

is required by

PDFBOX-895 Infinite recursion when trying to extract text from specific types of PDFs

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Kevin Jackson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Feb/11 02:35

Updated:: 29/May/12 16:21

Resolved:: 09/Nov/11 07:13