[TIKA-465] LanguageIdentifier API enhancements - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: languageidentifier
Labels:
None

Description

As originally reported by Jerome Charron in ~~NUTCH-86~~, Jerome identified a set of improvements for the LanguageIdentifier that we should consider in Tika:

More informations can be found on the following thread on Nutch-Dev mailing list:
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html

Summary:

1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code.

2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity().

I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch since LanguageIdentification is something that can happen in Tika-ville...

Attachments

Issue Links

is related to

TIKA-369 Improve accuracy of language detection

Resolved

Activity

People

Assignee:: Kenneth William Krugler

Reporter:: Chris A. Mattmann

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Jul/10 17:39

Updated:: 01/Mar/15 23:02

Resolved:: 01/Mar/15 23:00