[SOLR-4871] Another (fast) language identifier (port of langid.py) - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Trivial
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: contrib - LangId
Labels:
None

Description

I've ported langid.py – a Python language identifier with some very nice properties (see the research paper by Marco Lui) and pretty good language identification quality.

The major benefit though is speed. Without subsampling (which google code's languagedetection does) the benchmark on europarl clocks at:

--> langid-v3
     20826/     21000 (99.1714%) in 0.75 sec. (28075 docs/sec.)
--> languagedetect
     20846/     21000 (99.2667%) in 4.24 sec. (4948 docs/sec.)

So nearly the same language detection quality and five times faster. If you limit the number of languages to detect it'll be faster still – see the benchmarking snippets.

Yet another nice property is that it runs on UTF8 sequences natively. I've built-in a loop with the default Java's charset decoder but if you already have BytesRef you don't need to create strings at all.

https://oss.sonatype.org/content/repositories/releases/com/carrotsearch/langid-java/

The source code is at github:
https://github.com/carrotsearch/langid-java