Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1696

Language Identification with Text Processing Toolkit from MITLL

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.13
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      The aim here is to extend the methods for language identification within text. MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken with the MITLL guys there is a possibility that there is a scala version of this library which would make it easier to package in with Tika.

      At this point I'm not quite sure how many languages this library supports by default but it can be extended when provided some training data.

      [1] https://github.com/mit-nlp/Text.jl

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                pramirez Paul Ramirez
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: