[PDFBOX-1153] Use dictionary lookups to increase text extraction accuracy - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

There are still some cases where the text extraction code incorrectly inserts spaces inside words extracted from a PDF document. We could increase extraction accuracy with an optional dictionary lookup mechanism that checks each extracted word or token against a dictionary of common words. If the lookup fails (and the amount of empty space after the token is small), the token is concatenated with the next one. If that concatenated token matches a word in the dictionary, the intervening space can very likely be dropped.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jukka Zitting

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Oct/11 16:56

Updated:: 17/Jun/14 20:19

Resolved:: 17/Jun/14 20:19