Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4871

Another (fast) language identifier (port of langid.py)

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Trivial
    • Resolution: Unresolved
    • None
    • None
    • contrib - LangId
    • None

    Description

      I've ported langid.py – a Python language identifier with some very nice properties (see the research paper by Marco Lui) and pretty good language identification quality.

      The major benefit though is speed. Without subsampling (which google code's languagedetection does) the benchmark on europarl clocks at:

      --> langid-v3
           20826/     21000 (99.1714%) in 0.75 sec. (28075 docs/sec.)
      --> languagedetect
           20846/     21000 (99.2667%) in 4.24 sec. (4948 docs/sec.)
      

      So nearly the same language detection quality and five times faster. If you limit the number of languages to detect it'll be faster still – see the benchmarking snippets.

      Yet another nice property is that it runs on UTF8 sequences natively. I've built-in a loop with the default Java's charset decoder but if you already have BytesRef you don't need to create strings at all.

      https://oss.sonatype.org/content/repositories/releases/com/carrotsearch/langid-java/

      The source code is at github:
      https://github.com/carrotsearch/langid-java

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dweiss Dawid Weiss

            Dates

              Created:
              Updated:

              Slack

                Issue deployment