Description
Some language packs available on Tesseract's github support vertical orientations of Chinese (chi_sim_vert and chi_tra_vert). Trying to specify them via TesseractOCRConfig.setLanguage(String language) results in an exception because the regex is not expecting another underscore in the name.
/** * Set tesseract language dictionary to be used. Default is "eng". * Multiple languages may be specified, separated by plus characters. * e.g. "chi_tra+chi_sim" */ public void setLanguage(String language) { if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4})?(\\+?))+") || language.endsWith("+")) { throw new IllegalArgumentException("Invalid language code"); } this.language = language; }
What is the reason behind validating language options?
Either way, I'd be more than happy to supply a patch. Thank you.
Attachments
Attachments
Issue Links
- is related to
-
TIKA-2231 Invalid language code exception
- Resolved