Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2231

Invalid language code exception

    XMLWordPrintableJSON

Details

    Description

      There is a regex in TesseractOCRConfig.setLanguage(String language) which attempts to validate the language being set. Unfortunately it does not allow you to set some languages that are valid for tesseract.

      For example:

      TesseractOCRConfig config = new TesseractOCRConfig();
      config.setLanguage("chi_tra");

      This throws an IllegalArgumentException because of the '_' in the language name. "chi_tra" is a valid tesseract language code.

      Need to update the regex to allow '_' character.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              pmweiss5 Peter Weiss
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 1h
                  1h
                  Remaining:
                  Remaining Estimate - 1h
                  1h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified