Details
Description
There is a regex in TesseractOCRConfig.setLanguage(String language) which attempts to validate the language being set. Unfortunately it does not allow you to set some languages that are valid for tesseract.
For example:
TesseractOCRConfig config = new TesseractOCRConfig();
config.setLanguage("chi_tra");
This throws an IllegalArgumentException because of the '_' in the language name. "chi_tra" is a valid tesseract language code.
Need to update the regex to allow '_' character.
Attachments
Issue Links
- relates to
-
TIKA-3207 Invalid language code in TesseractOCRConfig
- Resolved
- links to