Description
The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that:
- LanguageIdentifier is deprecated in Tika.
- It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), and it doesn’t even fail gracefully with them - in my experience Chinese was recognized as Italian.
Attachments
Issue Links
- is part of
-
NUTCH-2891 Upgrade to Tika 2.1
- Closed
- is related to
-
NUTCH-2278 Handle alpha-2 language codes consistently
- Open
- links to