Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.11
-
None
-
Patch
Description
The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.
This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if chrismattmann moves forward with turning language detection into a pluggable extension point.
I'll add comments with results below.
Attachments
Attachments
Issue Links
- is related to
-
TIKA-856 Support CJK (Chinese, Japanese and Korean) language detection
- Open
-
TIKA-493 Support for macro languages
- Open
-
TIKA-491 Add language identification support for Norwegian Bokmål and Norwegian Nynorsk
- Resolved
-
TIKA-492 Add language identification support for North Sami, Lule Sami and South Sami
- Closed
-
TIKA-568 Language Detection isReasonablyCertain() hides valuable information
- Open
-
TIKA-369 Improve accuracy of language detection
- Resolved
- is required by
-
TIKA-1872 Backport tika-langdetect from 2.x branch to 1.13 branch
- Resolved
- relates to
-
TIKA-1696 Language Identification with Text Processing Toolkit from MITLL
- Resolved
-
NUTCH-1397 language-identifier incorrectly handles double-barreled language properties
- Open
- supercedes
-
TIKA-496 Language identifier profile comparison favors large profiles
- Closed
-
TIKA-856 Support CJK (Chinese, Japanese and Korean) language detection
- Open
-
TIKA-568 Language Detection isReasonablyCertain() hides valuable information
- Open
-
TIKA-369 Improve accuracy of language detection
- Resolved