Details

      Description

      EnhancementEngine based on the AnalysedText ContentPart that takes text/plain from a content item and stores an AnalyzedText content part in the content item where each token is assigned its grammar POS tag.

      This EnhancementEngine can consume

      • preexisting Tokens (if someone do not want to use a different Tokenizer as the OpenNLP one)
      • preexisting Sentences (of someone want to use a different Sentence detector as the OpenNLP one)

      if no tokens/sentences are present the Engine will use the OpenNLP components.

        Activity

        Hide
        rwesten Rupert Westenthaler added a comment -

        with http://svn.apache.org/viewvc?rev=1406170&view=rev the engine supports ServiceProperties

        Show
        rwesten Rupert Westenthaler added a comment - with http://svn.apache.org/viewvc?rev=1406170&view=rev the engine supports ServiceProperties
        Hide
        rwesten Rupert Westenthaler added a comment - - edited

        Documentation for this engine

        OpenNLP POS Tagging Engine
        ==========

        POS tagging Engine using the [AnalyzedText](../nlp/analyzedtext) ContentPart based on the [OpenNLP](http://opennlp.apache.org) POS tagging functionality.

          1. Consumed information
        • _Language_ (required): The language of the text needs to be available. It is read as specified by STANBOL-613(https://issues.apache.org/jira/browse/STANBOL-613) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
        • _Sentences_ (optional): In case Sentence_s are available in the _AnalyzedText content part the tokenization of the text is done sentence by sentence. If no _Sentence_s are available this engine detects sentences if a sentence detection model is available for that language (see below for more information). If no _Sentence_s are present and no OpenNLP sentence detection model is available for the language of the processed text, than the whole text is processed as a single sentence.
        • _Tokens_ (optional): Foe POS tagging the Text needs to be tokenized. This Engine tries to consume Tokens from the AnalyzedText content part. If no Tokens are available it uses the OpenNLP tokenizer to tokenize the text (see below for more information).
          1. POS Tagging

        POS tags are represented by adding NlpAnnotations#POS_ANNOTATION's to the Tokens of the AnalyzedText content part. As the OpenNLP Tokenizer supports multiple Pos tags/probability suggestions the OpenNLP POS Tagging Engine can add multiple POS annotations to a Token.

        POS annotations are added by using the key "stanbol.enhancer.nlp.pos" and are represented by the PosTag class. However typical users will rather use the NlpAnnotations#POS_ANNOTATION to access POS annotations of tokens

        :::java
        //The POS tag with the highest probability
        Value<PosTag> posAnnotation = token.getAnnotation(NlpAnnotations.POS_ANNOTATION);
        //Get the list of all POS annotations
        List<Value<PosTag>> posAnnotations = token.getAnnotations(NlpAnnotations.POS_ANNOTATION);

        //Value provides the probability and the PosTag
        double prob = posAnnotation.probability();
        PosTag pos = posAnnotation.value();
        //The string tag as used by the Tokenizer
        String tag = pos.getTag();

        //POS tags can be mapped to LexicalCategories and Pos types
        //so we can check if a Token is a Noun without the need to
        //know the POS tags used by the POS tagger of the current language
        boolean isNoun = pos.hasCategory(LexicalCategory.Noun);
        boolean isProperNoun = pos.hasPos(Pos.ProperNoun);

        //but not all PosTags might be mapped so we should check for
        boolean mapped = pos.isMapped();

        The OpenNLP Pos Tagging engine supports mapped PosTags for the following languages

        _TODO:_ Currently the Engine is limited to those TagSets as it is not yet possible to extend this by additional one.

          1. Tokenizing and Sentence Detection Support

        The OpenNLP POS Tagging engine implicitly supports tokenizing and sentence detection. That means if the [AnalyzedText](../nlp/analysedtext) is not present or does not contain _Token_s than this engine will use the OpenNLP Tokenizer to tokenize the text. If no language specific OpenNLP tokenizer model is available, than it will use the SIMPLE_TOKENIZER.

        Sentence detection is only done if no Sentence_s are present in the _AnalyzedText AND if a language specific sentence detection model is available.

        _NOTE: Support for Tokenizing and Sentence Detection is not a replacement for explicitly adding a Tokenizing and Sentence Detection Engine to a Enhancement Chain as this Engine does not guarantee that _Token_s or _Sentence_s are added to the _AnalyzedText content part. If no POS model is available for a language or a language is not configured to be processed there will be no _Token_s nor _Sentence_s added. Chains the relay on _Token_s and/or _Sentence_s MUST explicitly include a Tokenizing and Sentence detection engine!

          1. Configuration

        NOTE that the OpenNLP POS Tagging engine provides a default service instance (configuration policy is optional). This instance processes all languages where default POS models are provided by the OpenNLP service. This Engine instance uses the name 'opennlp-pos' and has a service ranking of '-100'.

        While this engine supports the default configuration including the _name_ (stanbol.enhancer.engine.name) and the _ranking_ (service.ranking) the engine also allows to configure _processed languages_ (org.apache.stanbol.enhancer.pos.languages) and a parameter to specify the name of the POS model used for a language.

        _1. Processed Language Configuraiton:_

        For the configuration of the processed languages the following syntax is used:

        de
        en

        This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

        !fr
        !it
        *

        This specifies that all Languages other than French and Italien are processed.

        Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported.

        The following example shows the two above examples combined to a single configuration.

        org.apache.stanbol.enhancer.pos.languages=["!fr","!it","de","en","*"]

        NOTE that the "processed language" configuration only specifies what languages are considered for processing. If "de" is enabled, but there is no POS model available for that language, than German text will still not be processed. However if there is a POS model for "it" but the "processed language" configuration does not include Italian, than Italian text will NOT be processed.

        _2. POS model parameter_

        The OpenNLP POS annotation engine supports the 'model' parameter to explicitly parse the name of the POS model used for a language. POS models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the

        {stanbol-working-dir}

        /stanbol/datafiles folder.

        The syntax for parameters is as follows

        {language}

        ;

        {param-name}

        =

        {param-value}

        So to use the "my-de-pos-model.zip" for POS tagging German texts one can use a configuration like follows

        de;model=my-de-pos-model.zip
        *

        By default OpenNLP POS models are loaded for the names '

        {lang}-pos-perceptron.bin' and '{lang}

        -pos-maxent.bin' to use models with other names users need to use the 'model' parameter as described above.

        Show
        rwesten Rupert Westenthaler added a comment - - edited Documentation for this engine OpenNLP POS Tagging Engine ========== POS tagging Engine using the [AnalyzedText] (../nlp/analyzedtext) ContentPart based on the [OpenNLP] ( http://opennlp.apache.org ) POS tagging functionality. Consumed information _ Language _ (required): The language of the text needs to be available. It is read as specified by STANBOL-613 ( https://issues.apache.org/jira/browse/STANBOL-613 ) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine. _ Sentences _ (optional): In case Sentence_s are available in the _AnalyzedText content part the tokenization of the text is done sentence by sentence. If no _Sentence_s are available this engine detects sentences if a sentence detection model is available for that language (see below for more information). If no _Sentence_s are present and no OpenNLP sentence detection model is available for the language of the processed text, than the whole text is processed as a single sentence. _ Tokens _ (optional): Foe POS tagging the Text needs to be tokenized. This Engine tries to consume Tokens from the AnalyzedText content part. If no Tokens are available it uses the OpenNLP tokenizer to tokenize the text (see below for more information). POS Tagging POS tags are represented by adding NlpAnnotations#POS_ANNOTATION 's to the Tokens of the AnalyzedText content part. As the OpenNLP Tokenizer supports multiple Pos tags/probability suggestions the OpenNLP POS Tagging Engine can add multiple POS annotations to a Token. POS annotations are added by using the key "stanbol.enhancer.nlp.pos" and are represented by the PosTag class. However typical users will rather use the NlpAnnotations#POS_ANNOTATION to access POS annotations of tokens :::java //The POS tag with the highest probability Value<PosTag> posAnnotation = token.getAnnotation(NlpAnnotations.POS_ANNOTATION); //Get the list of all POS annotations List<Value<PosTag>> posAnnotations = token.getAnnotations(NlpAnnotations.POS_ANNOTATION); //Value provides the probability and the PosTag double prob = posAnnotation.probability(); PosTag pos = posAnnotation.value(); //The string tag as used by the Tokenizer String tag = pos.getTag(); //POS tags can be mapped to LexicalCategories and Pos types //so we can check if a Token is a Noun without the need to //know the POS tags used by the POS tagger of the current language boolean isNoun = pos.hasCategory(LexicalCategory.Noun); boolean isProperNoun = pos.hasPos(Pos.ProperNoun); //but not all PosTags might be mapped so we should check for boolean mapped = pos.isMapped(); The OpenNLP Pos Tagging engine supports mapped PosTags for the following languages English: based on the Penn Treebank mappings to the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ) ( [annotation model] ( http://purl.org/olia/penn.owl ), [linking model] ( http://purl.org/olia/penn-link.rdf )) German: based on the STTS mapping to the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ) ( [annotation model] ( http://purl.org/olia/stts.owl ), [linking model] ( http://purl.org/olia/stts-link.rdf )) Spanish: based on the PAROLE TagSet mapping to the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ) ( [annotation model] ( http://purl.org/olia/parole_es_cat.owl )) Danish: mappings for the PAROLE Tagset as described by [this paper] ( http://korpus.dsl.dk/paroledoc_en.pdf ). Portuguese: mappings based on the [PALAVRAS tag set] ( http://beta.visl.sdu.dk/visl/pt/symbolset-floresta.html ) Dutch: mappings based on the WOTAN Tagset for Dutch as described by "WOTAN: Een automatische grammatikale tagger voor het Nederlands", doctoral dissertation, Department of language & Speech, Nijmegen University (renamed to Radboud University), december 1994." . NOTE that this TagSet does NOT distinguish between _ProperNoun_s and _CommonNoun_s. Swedish: based on the [Lexical categories in MAMBA] ( http://w3.msi.vxu.se/users/nivre/research/MAMBAlex.html ) _ TODO: _ Currently the Engine is limited to those TagSets as it is not yet possible to extend this by additional one. Tokenizing and Sentence Detection Support The OpenNLP POS Tagging engine implicitly supports tokenizing and sentence detection. That means if the [AnalyzedText] (../nlp/analysedtext) is not present or does not contain _Token_s than this engine will use the OpenNLP Tokenizer to tokenize the text. If no language specific OpenNLP tokenizer model is available, than it will use the SIMPLE_TOKENIZER. Sentence detection is only done if no Sentence_s are present in the _AnalyzedText AND if a language specific sentence detection model is available. _ NOTE : Support for Tokenizing and Sentence Detection is not a replacement for explicitly adding a Tokenizing and Sentence Detection Engine to a Enhancement Chain as this Engine does not guarantee that _Token_s or _Sentence_s are added to the _AnalyzedText content part. If no POS model is available for a language or a language is not configured to be processed there will be no _Token_s nor _Sentence_s added. Chains the relay on _Token_s and/or _Sentence_s MUST explicitly include a Tokenizing and Sentence detection engine! Configuration NOTE that the OpenNLP POS Tagging engine provides a default service instance (configuration policy is optional). This instance processes all languages where default POS models are provided by the OpenNLP service. This Engine instance uses the name 'opennlp-pos' and has a service ranking of '-100'. While this engine supports the default configuration including the _ name _ (stanbol.enhancer.engine.name) and the _ ranking _ (service.ranking) the engine also allows to configure _ processed languages _ (org.apache.stanbol.enhancer.pos.languages) and a parameter to specify the name of the POS model used for a language. _ 1. Processed Language Configuraiton: _ For the configuration of the processed languages the following syntax is used: de en This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages !fr !it * This specifies that all Languages other than French and Italien are processed. Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported. The following example shows the two above examples combined to a single configuration. org.apache.stanbol.enhancer.pos.languages= ["!fr","!it","de","en","*"] NOTE that the "processed language" configuration only specifies what languages are considered for processing. If "de" is enabled, but there is no POS model available for that language, than German text will still not be processed. However if there is a POS model for "it" but the "processed language" configuration does not include Italian, than Italian text will NOT be processed. _ 2. POS model parameter _ The OpenNLP POS annotation engine supports the 'model' parameter to explicitly parse the name of the POS model used for a language. POS models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the {stanbol-working-dir} /stanbol/datafiles folder. The syntax for parameters is as follows {language} ; {param-name} = {param-value} So to use the "my-de-pos-model.zip" for POS tagging German texts one can use a configuration like follows de;model=my-de-pos-model.zip * By default OpenNLP POS models are loaded for the names ' {lang}-pos-perceptron.bin' and '{lang} -pos-maxent.bin' to use models with other names users need to use the 'model' parameter as described above.

          People

          • Assignee:
            Unassigned
            Reporter:
            rwesten Rupert Westenthaler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development