Tika
  1. Tika
  2. TIKA-492

Add language identification support for North Sami, Lule Sami and South Sami

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.7
    • Fix Version/s: None
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      We need added support for Sami languages.

      According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

        Activity

        Hide
        Ken Krugler added a comment -

        Hi Jan,

        Do you have profile files for these languages?

        Thanks,

        – Ken

        Show
        Ken Krugler added a comment - Hi Jan, Do you have profile files for these languages? Thanks, – Ken
        Hide
        Jan Høydahl added a comment -

        I'm in the process of gathering enough text content for the profiles.

        I also posted a question to the user list to ask what tool/process you use to generate profiles but did not see an answer yet.

        Show
        Jan Høydahl added a comment - I'm in the process of gathering enough text content for the profiles. I also posted a question to the user list to ask what tool/process you use to generate profiles but did not see an answer yet.
        Hide
        Ken Krugler added a comment -

        Sorry, I must have missed that question. I think Jukka handled this previously, though Jerome and Chris did the original work in Nutch], then Jukka simplified things - see TIKA-209

        I'd repost, and depending on the response I'd file an issue about documenting the creation of language profiles.

        Show
        Ken Krugler added a comment - Sorry, I must have missed that question. I think Jukka handled this previously, though Jerome and Chris did the original work in Nutch], then Jukka simplified things - see TIKA-209 I'd repost, and depending on the response I'd file an issue about documenting the creation of language profiles.
        Hide
        Pander Musubi added a comment -

        Please see also https://issues.apache.org/jira/browse/TIKA-369 proposing to use https://code.google.com/p/language-detection/ for improved language detection.

        Show
        Pander Musubi added a comment - Please see also https://issues.apache.org/jira/browse/TIKA-369 proposing to use https://code.google.com/p/language-detection/ for improved language detection.

          People

          • Assignee:
            Ken Krugler
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development