Tika
  1. Tika
  2. TIKA-491

Add language identification support for Norwegian Bokmål and Norwegian Nynorsk

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.7
    • Fix Version/s: None
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      Currently there is one Norwegian language profile in Tika - "no". We need to distinguish between the two official Norwegian languages defined by ISO 639-1 codes "nb" and "nn". Those codes are recommended used instead of the common "no" tag.

      Proposed solved by removing the current language profile no.ngp and replacing it with two new ones for nb and nn.

      We must also add tests for Norwegian

        Issue Links

          Activity

          Hide
          Ken Krugler added a comment -

          Currently the language-detector library I'm integrating (see TIKA-1723) only has support for the 'no' metalanguage, not the two specific dialects as described above. I'd recommend opening an issue at that project (see https://github.com/optimaize/language-detector/). So closing this issue, unless somebody wants to (a) port the current built-in Tika detector to the new architecture, and (b) follow up with Jan about getting training text, and (c) add the new profiles. I'll wait a few days.

          Show
          Ken Krugler added a comment - Currently the language-detector library I'm integrating (see TIKA-1723 ) only has support for the 'no' metalanguage, not the two specific dialects as described above. I'd recommend opening an issue at that project (see https://github.com/optimaize/language-detector/ ). So closing this issue, unless somebody wants to (a) port the current built-in Tika detector to the new architecture, and (b) follow up with Jan about getting training text, and (c) add the new profiles. I'll wait a few days.
          Hide
          Pander Musubi added a comment -

          Please see also https://issues.apache.org/jira/browse/TIKA-369 proposing to use https://code.google.com/p/language-detection/ for improved language detection.

          Show
          Pander Musubi added a comment - Please see also https://issues.apache.org/jira/browse/TIKA-369 proposing to use https://code.google.com/p/language-detection/ for improved language detection.

            People

            • Assignee:
              Ken Krugler
              Reporter:
              Jan Høydahl
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development