Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.4.0
-
None
Description
The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2).
The patch is to use a TreeMap to achieve O(N log N) performance.
The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after.
BTW: The extracted text is also quite different compared to Adobe Reader. Not sure which is correct but for this document it doesn't matter.
Attachments
Attachments
Issue Links
- is required by
-
PDFBOX-895 Infinite recursion when trying to extract text from specific types of PDFs
- Closed