[NUTCH-2449] Usage of Tika LanguageIdentifier in language-identifier plugin - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13
Fix Version/s: 1.19
Component/s: plugin
Labels:
None

Description

The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that:

LanguageIdentifier is deprecated in Tika.
It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), and it doesn’t even fail gracefully with them - in my experience Chinese was recognized as Italian.

Attachments

Issue Links

is part of

NUTCH-2891 Upgrade to Tika 2.1

Closed

is related to

NUTCH-2278 Handle alpha-2 language codes consistently

Open

links to

Discussion on user@nutch

GitHub Pull Request #233

GitHub Pull Request #716

Activity

People

Assignee:: Unassigned

Reporter:: Yossi Tamari

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Oct/17 14:29

Updated:: 13/Mar/24 14:51

Resolved:: 18/Dec/21 04:11