[TIKA-1696] Language Identification with Text Processing Toolkit from MITLL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.13
Component/s: languageidentifier
Labels:
None

Description

The aim here is to extend the methods for language identification within text. MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken with the MITLL guys there is a possibility that there is a scala version of this library which would make it easier to package in with Tika.

At this point I'm not quite sure how many languages this library supports by default but it can be extended when provided some training data.

[1] https://github.com/mit-nlp/Text.jl

Attachments

Issue Links

is related to

TIKA-1723 Integrate language-detector into Tika

Resolved

is required by

TIKA-1872 Backport tika-langdetect from 2.x branch to 1.13 branch

Resolved

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Paul Ramirez

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Jul/15 17:38

Updated:: 22/Apr/16 22:24

Resolved:: 22/Apr/16 22:24