[TIKA-3207] Invalid language code in TesseractOCRConfig - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.24.1
Fix Version/s: 1.25
Component/s: ocr
Labels:
None

Description

Some language packs available on Tesseract's github support vertical orientations of Chinese (chi_sim_vert and chi_tra_vert). Trying to specify them via TesseractOCRConfig.setLanguage(String language) results in an exception because the regex is not expecting another underscore in the name.

    /**
     * Set tesseract language dictionary to be used. Default is "eng".     
     * Multiple languages may be specified, separated by plus characters.     
     * e.g. "chi_tra+chi_sim"
     */
    public void setLanguage(String language) {
        if (!language.matches("([a-zA-Z]{3}(_[a-zA-Z]{3,4})?(\\+?))+")
                || language.endsWith("+")) {
            throw new IllegalArgumentException("Invalid language code");
        }
        this.language = language;
    }

What is the reason behind validating language options?

Either way, I'd be more than happy to supply a patch. Thank you.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

tesseract_exe.PNG
08/Oct/20 22:15
14 kB
Daniel Smyda

Issue Links

is related to

TIKA-2231 Invalid language code exception

Resolved

Activity

People

Assignee:: Tim Allison

Reporter:: Daniel Smyda

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Oct/20 21:52

Updated:: 09/Oct/20 16:39

Resolved:: 09/Oct/20 15:56