[TIKA-3286] Tika does not issue an error when language file doesn't exist; not supporting script files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: None
Labels:
None

Description

Tika uses a regular expression to validate the language string, assuming it is set of ISO-639-2 language code separated by plus signs. However, Script files (in the script directory) can have any arbitrary name, with the only rule being that they start with a capital letter. The scripts were introduced in 4.0.0, https://github.com/manisandro/gImageReader/issues/323

In addition, if the user specifies an invalid language (i.e., the string matches the regular expression, but there is no corresponding language file in Tessdata), no error message is issued. Tesseract issues some very ugly and misleading messages which simply assume that you haven't set the tessdata directory correctly, but they are not captured by Tika (and not sure they would be appropriate anyway). Tika just blindly calls Tesseract but then doesn't get any output back.

I suggest parsing the language string by the plus sign and not doing any other validating on the string, but instead, actually checking to see that the file exists in either tessdata or tessdata/script.

If any of them don’t exists, then throw an exception, similar to what is done now when the language doesn't match the regular expression.

I've started to prototype this.

Later: I'm trying to clarify how the scripts are intended to be used. The page referenced above as well as https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES imply that the -l option accepts the name of a language or script. I assumed it would look in tessdata first and if not found, would look in tessdata/script. But it seems you have to enter the path.

tesseract --list-lang displays them this way

so it clearly knows about the script directory. But it expects the user to know it as well. Not sure if we want to make Tika work more friendly

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

script.png
28/Jan/21 18:56
23 kB
Peter Kronenberg
Screen Shot 2021-02-04 at 1.13.33 PM.png
04/Feb/21 18:14
217 kB
Tim Allison
Screen Shot 2021-02-04 at 1.13.20 PM.png
04/Feb/21 18:14
55 kB
Tim Allison
nolang.png
28/Jan/21 18:58
10 kB
Peter Kronenberg
list-lang.png
28/Jan/21 18:55
4 kB
Peter Kronenberg

Activity

People

Assignee:: Unassigned

Reporter:: Peter Kronenberg

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Jan/21 18:55

Updated:: 05/Feb/21 00:02

Resolved:: 04/Feb/21 19:58