Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3298

Add a "preloadLangs" parameter to TesseractOCRParser

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.0.0
    • None
    • None

    Description

      peterkronenberg on the user/dev lists and on TIKA-3297 and TIKA-3296 has observed that the tesseract error message for "lang data doesn't exist" is not extremely clear.  We could add a "preloadLangs" option to TesseractOCRParser (default would be false).  If set to true, the parser (upon initialization) if it finds tesseract, will call tesseract --list-langs and then store those langs. At parse time, if the langs set has anything in it, the TesseractOCRParser will check that set against the user-requested language and throw a clearer exception to the user that the language data doesn't exist for the requested language.

      Attachments

        1. image-2021-02-11-08-56-38-712.png
          50 kB
          Peter Kronenberg
        2. image-2021-02-10-19-00-10-691.png
          45 kB
          Peter Kronenberg
        3. image-2021-02-10-18-59-47-793.png
          17 kB
          Peter Kronenberg

        Activity

          People

            Unassigned Unassigned
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: