Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set of improvements for the LanguageIdentifier that we should consider in Tika:
More informations can be found on the following thread on Nutch-Dev mailing list:
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.htmlSummary:
1. LanguageIdentifier API changes. The similarity methods should return an ordered array of language-code/score pairs instead of a simple String containing the language-code.
2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity().
I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch since LanguageIdentification is something that can happen in Tika-ville...
Attachments
Issue Links
- is related to
-
TIKA-369 Improve accuracy of language detection
- Resolved