Stanbol
  1. Stanbol
  2. STANBOL-102

Make the NER enhancement engine able to use different models for different languages

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0-incubating
    • Component/s: Enhancer
    • Labels:
      None

      Description

      Currently, the list of models is hardcoded: it uses en-

      {person,location,organization}

      -ner.bin in a hardcoded way. The engine should be adapted to be able to lookup other models (following the

      {language-code}

      -

      {entity-class}

      -ner.bin filename pattern) according to the language of the text. If no such model is found, the engine should refuse compute enhancement instead of using the wrong model which will output spurious annotations.

        Activity

        Hide
        Rupert Westenthaler added a comment -

        I will implement this similar to the KeywordLinkingEngine.

        Two Options:

        • Default Language: If configured this is used as default if no language was detected for a text (e.g. if no language detection engine is active)
        • Processed Languages: Allows to configure a list of languages that are processed by an engine instance. If empty or not present all languages are processed. This allows to create multiple instances of the NER engine (with different configurations) that do only process some specific languages.

        In addition I will change this Entinge to use the ConfigurationFactory. This will allow multiple instances to be configured and include a default configuration with the default values for default language (none) and processed languages (any) within the stanbol launchers.

        The base framework that allows to dynamically load OpenNLP NER models for different languages was already implemented in the meantime by the OpenNLP utility (part of org.apache.stanbol.commons.opennlp module).

        Show
        Rupert Westenthaler added a comment - I will implement this similar to the KeywordLinkingEngine. Two Options: Default Language: If configured this is used as default if no language was detected for a text (e.g. if no language detection engine is active) Processed Languages: Allows to configure a list of languages that are processed by an engine instance. If empty or not present all languages are processed. This allows to create multiple instances of the NER engine (with different configurations) that do only process some specific languages. In addition I will change this Entinge to use the ConfigurationFactory. This will allow multiple instances to be configured and include a default configuration with the default values for default language (none) and processed languages (any) within the stanbol launchers. The base framework that allows to dynamically load OpenNLP NER models for different languages was already implemented in the meantime by the OpenNLP utility (part of org.apache.stanbol.commons.opennlp module).
        Hide
        Olivier Grisel added a comment -

        Looks great. +1.

        Show
        Olivier Grisel added a comment - Looks great. +1.
        Hide
        Rupert Westenthaler added a comment -

        This changes will also require the activation of the LangId engine in the Stable Launcher.

        Show
        Rupert Westenthaler added a comment - This changes will also require the activation of the LangId engine in the Stable Launcher.
        Hide
        Rupert Westenthaler added a comment -

        Implemented with revision #1228163

        Show
        Rupert Westenthaler added a comment - Implemented with revision #1228163

          People

          • Assignee:
            Rupert Westenthaler
            Reporter:
            Olivier Grisel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development