Been thinking a bit more, and attaching a (in my eyes) approved patch.
Those who override the config, they need better logging, else they're in the dark as to what happened.
Since we cannot use a log framework, I've added a way to detect and fetch errors through the API.
Also I've removed the throw of IOException in constructor. Since there is only a risk of that particular exception being thrown if the property file does not exist, in my opinion this will NEVER happen. So we should not force people to catch it. Also, it is better to keep backward API compatibility for all the existing applications out there. Users that override config will not either get the IOException since - if their tika.language.override.properties does not exist, tika will load the built-in one. And if the property file exists but is faulty, you'll not get IOException either. If there are errors in the properties file, you can test that through the new methods, and even ask for a list of languages that are successfully initialized.
Changes in the attached patch:
- public static boolean hasErrors()
- public static String getErrors()
- public static Set<String> getSupportedLanguages()
- Constructor no longer throws IOException
- Added better comments to public methods, including a warning for isReasonablyCertain() and short texts
Sample output from a test application:
Supported languages: [is, da, it, no, hu, th, de, el, fi, pt, pl, sv, fr, en, ru, et, es, nl]
Number of languages supported: 18
Has errors? true
Language xx (Unknown) not initialized. Message: Failed trying to load language profile for language "xx". Error: null