[TIKA-1723] Integrate language-detector into Tika - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.11
Fix Version/s: 1.13
Component/s: languageidentifier
Labels:
None

Flags:

Patch

Description

The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.

This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if chrismattmann moves forward with turning language detection into a pluggable extension point.

I'll add comments with results below.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-1723.patch
27/Aug/15 23:43
74 kB
Kenneth William Krugler
TIKA-1723v2.patch
28/Aug/15 15:43
770 kB
Tim Allison
TIKA-1723-2.patch
01/Sep/15 21:18
109 kB
Kenneth William Krugler
TIKA-1723-3.patch
01/Sep/15 21:48
110 kB
Kenneth William Krugler

Issue Links

is related to

TIKA-856 Support CJK (Chinese, Japanese and Korean) language detection

Open

TIKA-493 Support for macro languages

Open

TIKA-491 Add language identification support for Norwegian Bokmål and Norwegian Nynorsk

Resolved

TIKA-492 Add language identification support for North Sami, Lule Sami and South Sami

Closed

TIKA-568 Language Detection isReasonablyCertain() hides valuable information

Open

TIKA-369 Improve accuracy of language detection

Resolved

is required by

TIKA-1872 Backport tika-langdetect from 2.x branch to 1.13 branch

Resolved

relates to

TIKA-1696 Language Identification with Text Processing Toolkit from MITLL

Resolved

NUTCH-1397 language-identifier incorrectly handles double-barreled language properties

Open

supercedes

TIKA-496 Language identifier profile comparison favors large profiles

Closed

TIKA-856 Support CJK (Chinese, Japanese and Korean) language detection

Open

TIKA-568 Language Detection isReasonablyCertain() hides valuable information

Open

TIKA-369 Improve accuracy of language detection

Resolved

(1 is related to, 1 is required by, 2 relates to, 4 supercedes)

Activity

People

Assignee:: Kenneth William Krugler

Reporter:: Kenneth William Krugler

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 27/Aug/15 23:42

Updated:: 22/Apr/16 22:24

Resolved:: 22/Apr/16 22:24