[TIKA-369] Improve accuracy of language detection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6
Fix Version/s: None
Component/s: languageidentifier
Labels:
None

Description

Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues:

1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text.
3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

lingdet-mccs.pdf
25/Jan/10 17:25
215 kB
Kenneth William Krugler
Surprise and Coincidence.pdf
25/Jan/10 20:40
1.41 MB
Kenneth William Krugler
textcat.pdf
20/Nov/10 21:46
76 kB
Kenneth William Krugler

Issue Links

is duplicated by

TIKA-1091 Class LanguageIdentifier wrongly detecting the english language sentance

Closed

is related to

NUTCH-666 Analysis plugins for multiple language and new Language Identifier Tool

Closed

is superceded by

TIKA-1723 Integrate language-detector into Tika

Resolved

relates to

TIKA-209 Language detection is weak.

Closed

TIKA-496 Language identifier profile comparison favors large profiles

Closed

TIKA-322 Improve encoding detection speed and accuracy

Resolved

TIKA-1723 Integrate language-detector into Tika

Resolved

TIKA-465 LanguageIdentifier API enhancements

Closed

(3 relates to)

Activity

People

Assignee:: Kenneth William Krugler

Reporter:: Kenneth William Krugler

Votes:: 5 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 24/Jan/10 18:52

Updated:: 15/Dec/21 16:02

Resolved:: 15/Dec/21 16:02