Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3298

Add a "preloadLangs" parameter to TesseractOCRParser

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.0.0
    • None
    • None

    Description

      Peter Kronenberg on the user/dev lists and on TIKA-3297 and TIKA-3296 has observed that the tesseract error message for "lang data doesn't exist" is not extremely clear.  We could add a "preloadLangs" option to TesseractOCRParser (default would be false).  If set to true, the parser (upon initialization) if it finds tesseract, will call tesseract --list-langs and then store those langs. At parse time, if the langs set has anything in it, the TesseractOCRParser will check that set against the user-requested language and throw a clearer exception to the user that the language data doesn't exist for the requested language.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment