Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
There are still some cases where the text extraction code incorrectly inserts spaces inside words extracted from a PDF document. We could increase extraction accuracy with an optional dictionary lookup mechanism that checks each extracted word or token against a dictionary of common words. If the lookup fails (and the amount of empty space after the token is small), the token is concatenated with the next one. If that concatenated token matches a word in the dictionary, the intervening space can very likely be dropped.