Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2231

Invalid language code exception

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      There is a regex in TesseractOCRConfig.setLanguage(String language) which attempts to validate the language being set. Unfortunately it does not allow you to set some languages that are valid for tesseract.

      For example:

      TesseractOCRConfig config = new TesseractOCRConfig();
      config.setLanguage("chi_tra");

      This throws an IllegalArgumentException because of the '_' in the language name. "chi_tra" is a valid tesseract language code.

      Need to update the regex to allow '_' character.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            pmweiss5 Peter Weiss
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment