Details

      Description

      This EnhancementEngine requires Sentences and Tokens with POS annotations to be present in the AnalyzedText content part. It uses those information to create chunks and stores them in the AnalyzedText content part.

        Activity

        Hide
        rwesten Rupert Westenthaler added a comment -

        with http://svn.apache.org/viewvc?rev=1406170&view=rev the engine supports ServiceProperties

        Show
        rwesten Rupert Westenthaler added a comment - with http://svn.apache.org/viewvc?rev=1406170&view=rev the engine supports ServiceProperties
        Hide
        rwesten Rupert Westenthaler added a comment -

        Documentation for this Engine

        OpenNLP Chunker Engine
        =======

        The OpenNLP Chunker Engine support the detection of Phrases (Noun, Verb, ...) within the parsed Text. For that it uses the OpenNLP Chunker feature. Detected Phrases are added as Chunk_s to the _[AnalyzedText](../nlp/analyzedtext) content part. In addition added Chunk_s are annotated with an [Phrase Annotation](../nlp/nlpannotations#phrase-annotations) providing the type of the Phrase represented by the _Chunk.

          1. Consumed information
        • _Language_ (required): The language of the text needs to be available. It is read as specified by STANBOL-613(https://issues.apache.org/jira/browse/STANBOL-613) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
        • _Tokens with POS annotations_ (required): This Engine needs the Text to be tokenized and POS tagged. Even more the POS tags need to be compatible with the POS tags used to train the Chunker model. This effectively means that this Engine will only work as expected if the POS tagging was done by the OpenNLP POS Tagging Engine configured with a POS model using the same POS tag set as used for training the chunker model.
        • _Sentences_ (optional): In case Sentence_s are available in the _AnalyzedText content part the tokenization of the text is done sentence by sentence. Otherwise the whole text is tokenized at once.
          1. Configuration

        The OpenNLP Chunker Engine provides a default service instance (configuration policy is optional) that is configured to process all languages. For German the model parameter is set to 'OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip' a chunker model that only detects Noun Phrases. This model is included in the 'o.a.stanbol.data.opennlp.lang.de' module. This Engine instance uses the name 'opennlp-chunker' and has a service ranking of '-100'.

        This engine supports the default configuration for Enhancement Engines including the _name_ (stanbol.enhancer.engine.name) and the _ranking_ (service.ranking) In addition it is possible to configure the _processed languages_ (org.apache.stanbol.enhancer.chunker.languages) and an parameter to specify the name of the chunker model used for a language.

        _1. Processed Language Configuraiton:_

        For the configuration of the processed languages the following syntax is used:

        de
        en

        This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

        !fr
        !it
        *

        This specifies that all Languages other than French and Italien are processed.

        Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported.

        The following example shows the two above examples combined to a single configuration.

        org.apache.stanbol.enhancer.chunker.languages=["!fr","!it","de","en","*"]

        NOTE that the "processed language" configuration only specifies what languages are considered for processing. If "de" is enabled, but there is no sentence detection model available for that language, than German text will still not be processed. However if there is a POS model for "it" but the "processed language" configuration does not include Italian, than Italian text will NOT be processed.

        _2. Sentnece detection model parameter_

        The OpenNLP Sentence Detection engine supports the 'model' parameter to explicitly parse the name of the sentence detection model used for an language. Models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the

        {stanbol-working-dir}

        /stanbol/datafiles folder.

        The syntax for parameters is as follows

        {language}

        ;

        {param-name}

        =

        {param-value}

        As shown by the default configuration of this engine, to use "OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip" for detecting sentences in German texts one can use a configuration like follows

        de;model=OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip
        *

        By default OpenNLP chunker models are loaded from '

        {lang}

        -chunker.bin'. To use models with other names users need to use the 'model' parameter as described above.

        Show
        rwesten Rupert Westenthaler added a comment - Documentation for this Engine OpenNLP Chunker Engine ======= The OpenNLP Chunker Engine support the detection of Phrases (Noun, Verb, ...) within the parsed Text. For that it uses the OpenNLP Chunker feature. Detected Phrases are added as Chunk_s to the _ [AnalyzedText] (../nlp/analyzedtext) content part. In addition added Chunk_s are annotated with an [Phrase Annotation] (../nlp/nlpannotations#phrase-annotations) providing the type of the Phrase represented by the _Chunk . Consumed information _ Language _ (required): The language of the text needs to be available. It is read as specified by STANBOL-613 ( https://issues.apache.org/jira/browse/STANBOL-613 ) from the metadata of the ContentItem. Effectively this means that any Stanbol Language Detection engine will need to be executed before the OpenNLP POS Tagging Engine. _ Tokens with POS annotations _ (required): This Engine needs the Text to be tokenized and POS tagged. Even more the POS tags need to be compatible with the POS tags used to train the Chunker model. This effectively means that this Engine will only work as expected if the POS tagging was done by the OpenNLP POS Tagging Engine configured with a POS model using the same POS tag set as used for training the chunker model. _ Sentences _ (optional): In case Sentence_s are available in the _AnalyzedText content part the tokenization of the text is done sentence by sentence. Otherwise the whole text is tokenized at once. Configuration The OpenNLP Chunker Engine provides a default service instance (configuration policy is optional) that is configured to process all languages. For German the model parameter is set to 'OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip' a chunker model that only detects Noun Phrases. This model is included in the 'o.a.stanbol.data.opennlp.lang.de' module. This Engine instance uses the name 'opennlp-chunker' and has a service ranking of '-100'. This engine supports the default configuration for Enhancement Engines including the _ name _ (stanbol.enhancer.engine.name) and the _ ranking _ (service.ranking) In addition it is possible to configure the _ processed languages _ (org.apache.stanbol.enhancer.chunker.languages) and an parameter to specify the name of the chunker model used for a language. _ 1. Processed Language Configuraiton: _ For the configuration of the processed languages the following syntax is used: de en This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages !fr !it * This specifies that all Languages other than French and Italien are processed. Values can be parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback also ',' separated Strings are supported. The following example shows the two above examples combined to a single configuration. org.apache.stanbol.enhancer.chunker.languages= ["!fr","!it","de","en","*"] NOTE that the "processed language" configuration only specifies what languages are considered for processing. If "de" is enabled, but there is no sentence detection model available for that language, than German text will still not be processed. However if there is a POS model for "it" but the "processed language" configuration does not include Italian, than Italian text will NOT be processed. _ 2. Sentnece detection model parameter _ The OpenNLP Sentence Detection engine supports the 'model' parameter to explicitly parse the name of the sentence detection model used for an language. Models are loaded via the Stanbol DataFile provider infrastructure. That means that models can be loaded from the {stanbol-working-dir} /stanbol/datafiles folder. The syntax for parameters is as follows {language} ; {param-name} = {param-value} As shown by the default configuration of this engine, to use "OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip" for detecting sentences in German texts one can use a configuration like follows de;model=OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip * By default OpenNLP chunker models are loaded from ' {lang} -chunker.bin'. To use models with other names users need to use the 'model' parameter as described above.

          People

          • Assignee:
            Unassigned
            Reporter:
            rwesten Rupert Westenthaler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development