Tika
  1. Tika
  2. TIKA-492

Add language identification support for North Sami, Lule Sami and South Sami

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.7
    • Fix Version/s: None
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      We need added support for Sami languages.

      According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

        Issue Links

          Activity

          Hide
          Ken Krugler added a comment -

          Hi Jan,

          Do you have profile files for these languages?

          Thanks,

          – Ken

          Show
          Ken Krugler added a comment - Hi Jan, Do you have profile files for these languages? Thanks, – Ken
          Hide
          Jan Høydahl added a comment -

          I'm in the process of gathering enough text content for the profiles.

          I also posted a question to the user list to ask what tool/process you use to generate profiles but did not see an answer yet.

          Show
          Jan Høydahl added a comment - I'm in the process of gathering enough text content for the profiles. I also posted a question to the user list to ask what tool/process you use to generate profiles but did not see an answer yet.
          Hide
          Ken Krugler added a comment -

          Sorry, I must have missed that question. I think Jukka handled this previously, though Jerome and Chris did the original work in Nutch], then Jukka simplified things - see TIKA-209

          I'd repost, and depending on the response I'd file an issue about documenting the creation of language profiles.

          Show
          Ken Krugler added a comment - Sorry, I must have missed that question. I think Jukka handled this previously, though Jerome and Chris did the original work in Nutch], then Jukka simplified things - see TIKA-209 I'd repost, and depending on the response I'd file an issue about documenting the creation of language profiles.
          Hide
          Pander Musubi added a comment -

          Please see also https://issues.apache.org/jira/browse/TIKA-369 proposing to use https://code.google.com/p/language-detection/ for improved language detection.

          Show
          Pander Musubi added a comment - Please see also https://issues.apache.org/jira/browse/TIKA-369 proposing to use https://code.google.com/p/language-detection/ for improved language detection.
          Hide
          Ken Krugler added a comment -

          Currently the language-detector library I'm integrating (see TIKA-1723) doesn't support any of the three Sami languages. I'd open an issue at that project (see https://github.com/optimaize/language-detector/). So closing this issue, unless somebody wants to (a) port the current built-in Tika detector to the new architecture, and (b) follow up with Jan about getting training text, and (c) add the new profiles. I'll wait a few days.

          Show
          Ken Krugler added a comment - Currently the language-detector library I'm integrating (see TIKA-1723 ) doesn't support any of the three Sami languages. I'd open an issue at that project (see https://github.com/optimaize/language-detector/ ). So closing this issue, unless somebody wants to (a) port the current built-in Tika detector to the new architecture, and (b) follow up with Jan about getting training text, and (c) add the new profiles. I'll wait a few days.
          Hide
          Jan Høydahl added a comment -

          Closing, will consider contributing profiles to the other lib if I get the need again in some project

          Show
          Jan Høydahl added a comment - Closing, will consider contributing profiles to the other lib if I get the need again in some project

            People

            • Assignee:
              Ken Krugler
              Reporter:
              Jan Høydahl
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development