[TIKA-209] Language detection is weak. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.3
Fix Version/s: 0.5
Component/s: languageidentifier
Labels:
None

Description

in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector.

Please add a configurable level (0-100);

if (language != null && match.getConfidence() > THRESHOLD) {
metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
metadata.set(Metadata.LANGUAGE, match.getLanguage());
}

Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example.

Attachments

Issue Links

is related to

TIKA-369 Improve accuracy of language detection

Resolved

Activity

People

Assignee:: Jukka Zitting

Reporter:: Robert Newson

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 23/Mar/09 11:12

Updated:: 13/Jun/10 22:45

Resolved:: 07/Nov/09 04:34