Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-493

Support for macro languages

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.7
    • None
    • languageidentifier
    • None

    Description

      Some languages have variants, and there are ISO codes to identify both the variants as well as a code to identify the macro-language. There should be a way to tell whether the identified language is part of a "macro language" and to return the macro language. This is because different applications require different codes. E.g. for search it makes sense to tag the document with both the unique code and the macro code.

      Example:
      Norwegian: no
      Norwegian bokmål: nb
      Norwegian nynorsk: nn

      The getLanguage() call should continue to return the most correct and specific ISO code (according to which language profile matched).

      In addition, it should be possible to get the macro language.

      Proposed implementation:
      Add some new methods:

      public boolean hasMacroLanguage() // true | false
      public String getMacroLanguage() // In case of "nn" or "nb", result would be "no"

      The definition of macro languages can be added in the property file introduced in TIKA-490.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kkrugler Kenneth William Krugler
            janhoy Jan Høydahl

            Dates

              Created:
              Updated:

              Slack

                Issue deployment