Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1153

Use dictionary lookups to increase text extraction accuracy

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Text extraction
    • None

    Description

      There are still some cases where the text extraction code incorrectly inserts spaces inside words extracted from a PDF document. We could increase extraction accuracy with an optional dictionary lookup mechanism that checks each extracted word or token against a dictionary of common words. If the lookup fails (and the amount of empty space after the token is small), the token is concatenated with the next one. If that concatenated token matches a word in the dictionary, the intervening space can very likely be dropped.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jukkaz Jukka Zitting
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: