Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: enhancer-0.10.0
    • Component/s: Enhancer
    • Labels:
      None

      Description

      Because the management of NLP metadata - that is usually available on a word granularity - is not feasible using the RDF metadata this describes the addition of a special ContentPart Stanbol. This ContentPart will have the name AnalysedText.

      AnalysedText
      =====

      • It wraps the text/plain ContentPart of a ContentItem
      • It allows the definition of Spans (type, start, end, spanText). Type
        is an Enum: Text, TextSection, Sentence, Chunk, Span
      • Spans are sorted naturally by type, start and end. This allows to
        use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
        to work with contained Tokens. The #higher and #lower methods of
        NavigateableSet even allow to build Iterators that allow concurrent
        modifications (e.g adding Chunks while iterating over the Tokens of a
        Sentence).
      • One can attach Annotations to Spans. Basically a multi-valued Map
        with Object keys and Value<valueType> value(s) that support a type
        save view by using generically typed Annotation<key,valueType>
      • The Value<valueType> object natively supports confidence. This
        allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
        tag for Noun) to be used for all noun annotations.
      • Note that the AnalysedText does NOT use RDF as representing those
        kind of data as RDF is not scaleable enough. This also means that the
        data of the AnalysedText are NOT available in the Enhancement Metadata
        of the ContentItem. However EnhancementEngines are free to write
        all/some results to the AnalysedText AND the RDF metadata of the
        ContentItem.

      Here is a sample code

      AnalysedText at; //the contentPart
      Iterator<Sentence> sentences = at.getSentences;
      while(sentences.hasNext){
      Sentence sentence = sentences.next();
      String sentText = sentence.getSpan();
      Iterator<SentenceToken> tokens = sentence.getTokens();
      while(tokens.hasNext())

      { Token token = tokens.next(); String tokenText = token.getSpan(); Value<PosTag> pos = token.getAnnotation( NlpAnnotations.posAnnotation); String tag = pos.value().getTag(); double confidence = pos.probability(); }

      }

      NLP annotations
      =====

      • TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
        contains Tags of a specific generic type. The Tag only defines a
        String "tag" property
      • Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
        defined. Both define also an optional LexicalCategory. This is a enum
        with the 12 top level concepts defined by the
        [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
        Adjective, Adposition, Adverb ...)
      • TagSets (including mapped LexicalCategories) are defined for all
        languages where POS taggers are available for OpenNLP. This includes
        also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
        OLIA. The other TagSets used by OpenNLP are currently not available by
        Olia.
      • Note that the LexicalCategory can be used to process POS annotations
        of different languages

      TagSet:
      https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
      POS:
      https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos

      A code sample:

      TagSet<PosTag> tagSet; //the used TagSet
      Map<String,PosTag> unknown; //missing tags in the TagSet

      Token token; //the token
      String tag; //the detected tag
      double prob; //the probability

      PosTag pos = tagset.getTag(tag);
      if(pos == null)

      { //unkonw tag pos = unknown.get(tag); }

      if(pos == null)

      { pos = new PosTag(tag); //this tag will not have a LexicalCategory unknown.add(pos); //only one instance }

      token.addAnnotation(
      NlpAnnotations.POSAnnotation,
      new Value<PosTag>(pos, prob));

        Activity

        Hide
        rwesten Rupert Westenthaler added a comment -

        considered to be implemented with http://svn.apache.org/viewvc?rev=1412121&view=rev. Further changes adaptions should be implemented in own (more focused) issues

        Show
        rwesten Rupert Westenthaler added a comment - considered to be implemented with http://svn.apache.org/viewvc?rev=1412121&view=rev . Further changes adaptions should be implemented in own (more focused) issues
        Hide
        rwesten Rupert Westenthaler added a comment -

        Documentation for the NLP Annotations

        NLP Annotations
        ===========

        While the The [Analyzed Text](analyzedtext) interface allows to define Sentences, Chunks and Tokens within the text and also to attach annotations to those this part of the Stanbol NLP processing module provides the Java domain model for the annotations section this part of the Stanbol NLP processing module defines the Java domain model used for those annotations. This includes annotation models for Part of Speech (POS) tags, Chunks , recognized Named Entities (NER) as well as morphological analysis.

            1. Part of Speech (POS) annotations

        Part of Speech (POS) tagging represents an token level annotation. It assigns tokens with categories like noun, verb, adjectives, punctuation ... This annotations are typically provided by an POS tagger that consumes Tokens and provides tag(s) with confidence(s) as output. Tags are usually string values that are member of a TagSet - a fixed list of tags used to annotate tokens. Those Tag sets are typically language and often even trainings corpus specific. This makes it really hard to consume POS tags created by different POS tagger for different languages as the consumer would need to know about the meanings of all the different POS tags for the different languages.

        The POS annotation model defined by the Stanbol NLP module tries to solve this issue by providing means to align POS tag sets with formal categories defined by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The following sub-section will provide details and usage examples.

              1. OLiA MorphosyntacticCategories

        The '[OLiA](http://nlp2rdf.lod2.eu/olia/) Reference Model for Morphology and Morphosyntax, with experimental extension to Syntax' defines a set of ~150 formally defined and multi-lingual POS tags. Those types are defined as a non-cyclic multi-hierarchy with 'oilia:MorphosyntacticCategory' as common root.

        To give an example the POS 'olia:Gerund' is defined as a 'olia:NonFiniteVerb' what itself is a 'olia:Verb'. An example for a multi-hierarchy is 'olia:NominalQuantifier' that is both a 'olia:Noun' and a 'olia:Quantifier'.

        To allow support a nice integration of the formal definitions by the OLiA ontology within the Stanbol NLP annotations there are two Java enumerations:

        • _LexicalCategories_: This enumeration covers the 12 top level categories as defined by OLiA. This includes Noun, Verb, Adjective, Adposition, Adverb, Conjuction, Interjection, PronounOrDeterminer, Punctuation, Quantifier, Residual and Unique.
        • _Pos: This enumeration covers all OLiA MorphosyntacticCategories from the 2+ level. So by using the _Pos enum one can e.g. distinguish between ProperNoun's and CommonNoun's or FiniteVerb's and NonFiniteVerb's ... The Pos enumeration has full support for the multi-hierarchy as defined by OLiA. The Pos#categories() methods allows to get the 1st level parents of Pos. The Pos#hierarchy() returns all 2+ level parents of a Pos member.
              1. PosTag and TagSet

        The PosTag represents a POS tag as used by an POS tagger. PosTags do support the following features:

        • _tag_ [1..1]::Stirng - This is the string tag as used by the POS tagger.
        • _category_ [0..*]::LexicalCategory - The assigned LexicalCategory enumeration members.
        • _pos_ [0..*]::Pos - The assigned Pos enumeration members.

        An Example for a PosTag representing a 'olia:ProperNoun' looks like follows

        :::java
        PosTag tag = new PosTag("NP", Pos.ProperNoun);

        The first parameter is the String POS tag used by the POS tagger and the second parameter represents the mapping to the OLiA MorphosyntacticCategories for this tag. The next example shows an sofisticated mapping for the "PWAV" (Pronominaladverb) as used by the STTS tag set for the German language

        :::java
        new PosTag("PWAV", LexicalCategory.Adverb, Pos.RelativePronoun, Pos.InterrogativePronoun);

        TagSet is the other important class as it allows to manage the set of PosTag instances. TagSet has two main functions: First it allows an integrator of an POS tagger with Stanbol to define the mappings from the string POS tags used by the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable used by the Stanbol NLP chain. Second it ensures that there is only a single instance of PosTag used to annotate all Tokens with the same type.

        _TagSet_s are typically specified as static members of utility classes. The following code snippet shows an example

        :::java
        //Tagset is generically typed. We need a TagSet for PosTag's
        public static final TagSet<PosTag> STTS = new TagSet<PosTag>(
        "STTS", "de"); //define a name and the languages it supports

        static

        { //you can set properties to a TagSet. While supported this //feature is currently not used by Stanbol STTS.getProperties().put("olia.annotationModel", new UriRef("http://purl.org/olia/stts.owl")); STTS.getProperties().put("olia.linkingModel", new UriRef("http://purl.org/olia/stts-link.rdf")); STTS.addTag(new PosTag("ADJA", Pos.AttributiveAdjective)); STTS.addTag(new PosTag("ADJD", Pos.PredicativeAdjective)); STTS.addTag(new PosTag("ADV", LexicalCategory.Adverb)); //[...] }

        The string tag (first parameter) of the PosTag is used as unique key by the TagSet. Adding an 2nd PasTag with the same tag will override the first one. PosTag_s that are added to a _TagSet have the Tag#getAnnotationModel() property set to that model.

        The final example shows a code snippet shows the core part of an POS tagging engine using the both the [AnalyzedText](analyzedtext) and the PosTag and TagSet APIs.

        :::java
        TagSet<PosTag> tagSet; //the used TagSet
        //holds PosTags for tags returned by the POS tagger that
        //are missing in the TagSet
        Map<String,PosTag> adhocTags = new HashMap<String,PosTag>():
        List<Span> token = new ArrayList<Span>(64);

        Iterator<Section> sentences; //Iterator over the sentences

        while(sentences.hasNext()){
        Section sentence = sentences.next();
        //get the tokens of the current sentence
        token.clean();
        AnalysedTextUtils.appandToList(
        sentence.getEnclosed(SpanTypeEnum.Token),
        tokenList);
        //typically one needs also to get the Strings
        //of the tokens for the pos tagger
        String[] tokenText = new String[tokenList.size()];
        for(int i=0;i<tokens.size();i++)

        { tokenText[i] = tokens.get(i).getSpan(); }

        //now POS tag the sentence
        String[] posTags = posTagger.tag(tokens);

        //finally apply the PosTags and save the annotation
        for(int i=0;i<tokens.size();i++){
        PosTag tag = tagSet.get(posTags[i]);
        if(tag == null)

        { //unmapped tag tag = adhocTags.get(posTags[i]); }

        if(tag == null)

        { //unknown tag tag = new PosTag(posTags[i]); adhocTags.put(posTags[i],tag); }

        //add the annotation to the Token
        token.addAnnotation(
        NlpAnnotations.POS_ANNOTATION,
        Value.value(tag));
        }
        }

            1. Phrase annotations

        Phrase annotations can be used to define the type of a Chunk. The PhraseTag class is used for phrase annotations. It defines first a string tag and secondly the Phrase category. The LexicalCategory enumeration is used as valued for the category. As the PhraseTag is a subclass of Tag it can be also used in combination with the TagSet class as described in the [PosTag and TagSet] section.

        The following code snippets show how to create a PhraseTag for noun phrases

        :::java
        PhraseTag tag = new PhraseTag("NP", LexicalCategory.Noun);

            1. Name Entity (NER) annotations

        Named Entity annotations are created by NER modules. Before the Stanbol NLP chain they where represented in Stanbol by using '[fise:TextAnnotation](../enhancementstructure#fisetextannotation)'s and any Enhancement Engine that does NER should still support this. With the Stanbol NLP processing module it is now also possible to represent detected Named Entities as Chunk with an PhraseTag added as Annotation.

        A Named Entity represented as 'fise:TextAnnotation' includes the following information:

        urn:namedEntity:1
        rdf:type fise:TextAnnotation, fise:Enhancement
        fise:selected-text

        {named-entity-text}
        fise:start {start-char-pos}
        fise:end {end-char-pos}
        dc:type {named-entity-type}

        where:

        * {named-entity-text}

        is the text recognized as Named Entity. This is the same as returned by Chunk#getSpan()

        • {start-char-pos}

          is the start character position of the Named Entity relative to the start of the text. This is the same as Chunk#getStart()

        • {end-char-pos}

          is the end position and the same as Chunk#getEnd()

        • {named-enttiy-type}

          is the type of the recognized Named Entity as URI. The _PhraseTag allows to define both the string tag as used by the NER component as well as the URI this type is mapped to. In Stanbol it is preferred to use 'dbpedia:Person', 'dbpedia:Organisation' and 'dbpedia:Place' for the according entity types.

        The NerTag class extends Tag and can therefore be also used with the TagSet class. This means that users of the API can use TagSet to manage the string tag to URI mappings for the supported Named Entity types.

        The following Code Snippets shows how to add NER annotations to the AnalysedText:

        :::java
        AnalysedText at; //The AnalysedText
        TagSet<NerTag> nerTags; //registered NER tags
        Iterator<Section> sections; //sections to iterate over

        List<String> tokenTexts = new ArrayList<Span>(64);

        while(sections.hasNext()){
        Section section = sections.next();
        //NER tagger typically need String[] as input
        token.clean();
        Iterator<Token> tokens = section.getTokens;
        while(tokens.hasNext())

        { tokenTexts.add(tokens.next().getSpan()); }

        //Span -> #start #end #type #probability
        Span[] nerSpans = nerTagger.tag(
        tokenTexts.toArray(new String[tokenTexts.size()]);
        for(int i=0; i < nerSpans.length; i++){
        Chunk namedEntity = at.addChunk(
        nerSpans[i].start,nerSpans[i].start);
        NerTag tag = nerTags.get(nerSpans[i].type)
        if(tag == null)

        { //unmapped NER tag = new NerTag(nerSpans[i].type); }

        namedEntity.addAnnotation(
        NlpAnnotations.NER_ANNOTATION,
        Value.value(tag, nerSpans[i]. probability));
        }
        }

        Note that the above Code Snippet only shows how to add the Named Entity to the AnalyzedText ContentPart. A actual NER engine Implementation needs also to add those information to the metadata of the [ContentItem](../contentitem).

        :::java
        ContentItem ci; //The processed ContentItem
        Language lang; //The Language of the processed Text
        MGraph metadata = ci.getMetadata();
        Section section; //the current Section
        Chunk namedEntity //the currently processed Named Entity

        Value<NerTag> nerAnnotation = namedEntity.getAnnotation(
        NlpAnnotations.NER_ANNOTATION);

        UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this);
        metadata.add(new TripleImpl(textAnnotation, ENHANCER_SELECTED_TEXT,
        new PlainLiteralImpl(namedEntity.getSpan(), language)));
        metadata.add.add(new TripleImpl(textAnnotation, ENHANCER_SELECTION_CONTEXT,
        new PlainLiteralImpl(section.getSpan(), language)));
        if(tag.getType() != null)

        { metadata.add(new TripleImpl(textAnnotation, DC_TYPE, nerAnnotation.value().getType)); }

        //else do not add an dc:type for unmapped NamedEntities
        g.add(new TripleImpl(textAnnotation, ENHANCER_CONFIDENCE,
        literalFactory.createTypedLiteral(nerAnnotation.probability())));
        g.add(new TripleImpl(textAnnotation, ENHANCER_START,
        literalFactory.createTypedLiteral(namedEntity.getStart()));
        g.add(new TripleImpl(textAnnotation, ENHANCER_END,
        literalFactory.createTypedLiteral(namedEntity.getEnd())));

            1. Morphological Analyses

        _NOTE:_ This part of the Stanbol NLP annotations is still work in progress. So this part of the API might undergo heavy changes even in minor releases.

        The results of a Morphological Analyses are represented by the MorphoFeatures class and can be added to the analyzed word (Token) by using the NlpAnnotations.MORPHO_ANNOTATION. The MorphoFeatures class provides the following features:

        • _Lemma_: A String value representing the lemmatization of the annotated Token.
        • _Case: The _Case enumeration contains around 70 members defined based on concepts of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The CaseTag allows to define cases and optionally map them to the cases defined by the enumeration.
        • _Definitness: The _Definitness enumeration has the members Definite and Indefinite also defined by Concepts in the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/).
        • _Gender: The _Gender enumeration contains the six gender defined by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The GenderTag allows to define Genders and optionally map them to the gender defined by the enumeration.
        • _Number: The _NumberFeature enumeration defines the eight number features defined by [OLiA](http://nlp2rdf.lod2.eu/olia/). The NumberTag can be used to define number features and map them to the members of the enumeration
        • _Person: the _Person enumeration has the definitions for 'first', 'second' and 'third' with mappings to the according concepts of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/).
        • _Tense: The _Tense enumeration represents the tense hierarchy as defined by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). the Tense#getParent() allows access to the direct parent of a Tense while the Tense#getTenses() method can be used to obtain the transitive closure (including the Tens object itself). TenseTag is used for Tense annotations. It allows both to parse a string tag representing the tense as well as defining a mapping to the tenses defined by the Tense enumeration.
        • _Mood: The _VerbMood enumeration currently defines members from different part of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). While OLiA does define the 'ilia:MoodFeature' class but those members had not a good match with verb moods as used by the CELI/linguagrid.org service. For now the decision was to define the VerbMood enumeration more closely to the usage of CELI, but this needs clearly to be validated as soon as implementations for other NLP frameworks are added. Their is also a VerbMoodTag that allows to define verb moods by a string tag and an mapping to the VerbMood enumeration.

        The MorphoFeatures supports multi valued annotations for all the above features. Getter for a single value will always return the first added value.

        Show
        rwesten Rupert Westenthaler added a comment - Documentation for the NLP Annotations NLP Annotations =========== While the The [Analyzed Text] (analyzedtext) interface allows to define Sentences, Chunks and Tokens within the text and also to attach annotations to those this part of the Stanbol NLP processing module provides the Java domain model for the annotations section this part of the Stanbol NLP processing module defines the Java domain model used for those annotations. This includes annotation models for Part of Speech (POS) tags, Chunks , recognized Named Entities (NER) as well as morphological analysis. Part of Speech (POS) annotations Part of Speech (POS) tagging represents an token level annotation. It assigns tokens with categories like noun, verb, adjectives, punctuation ... This annotations are typically provided by an POS tagger that consumes Tokens and provides tag(s) with confidence(s) as output. Tags are usually string values that are member of a TagSet - a fixed list of tags used to annotate tokens. Those Tag sets are typically language and often even trainings corpus specific. This makes it really hard to consume POS tags created by different POS tagger for different languages as the consumer would need to know about the meanings of all the different POS tags for the different languages. The POS annotation model defined by the Stanbol NLP module tries to solve this issue by providing means to align POS tag sets with formal categories defined by the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ). The following sub-section will provide details and usage examples. OLiA MorphosyntacticCategories The ' [OLiA] ( http://nlp2rdf.lod2.eu/olia/ ) Reference Model for Morphology and Morphosyntax, with experimental extension to Syntax' defines a set of ~150 formally defined and multi-lingual POS tags. Those types are defined as a non-cyclic multi-hierarchy with 'oilia:MorphosyntacticCategory' as common root. To give an example the POS 'olia:Gerund' is defined as a 'olia:NonFiniteVerb' what itself is a 'olia:Verb'. An example for a multi-hierarchy is 'olia:NominalQuantifier' that is both a 'olia:Noun' and a 'olia:Quantifier'. To allow support a nice integration of the formal definitions by the OLiA ontology within the Stanbol NLP annotations there are two Java enumerations: _ LexicalCategories _: This enumeration covers the 12 top level categories as defined by OLiA. This includes Noun, Verb, Adjective, Adposition, Adverb, Conjuction, Interjection, PronounOrDeterminer, Punctuation, Quantifier, Residual and Unique. _ Pos : This enumeration covers all OLiA MorphosyntacticCategories from the 2+ level. So by using the _Pos enum one can e.g. distinguish between ProperNoun's and CommonNoun's or FiniteVerb's and NonFiniteVerb's ... The Pos enumeration has full support for the multi-hierarchy as defined by OLiA. The Pos#categories() methods allows to get the 1st level parents of Pos . The Pos#hierarchy() returns all 2+ level parents of a Pos member. PosTag and TagSet The PosTag represents a POS tag as used by an POS tagger. PosTags do support the following features: _ tag _ [1..1] ::Stirng - This is the string tag as used by the POS tagger. _ category _ [0..*] ::LexicalCategory - The assigned LexicalCategory enumeration members. _ pos _ [0..*] ::Pos - The assigned Pos enumeration members. An Example for a PosTag representing a 'olia:ProperNoun' looks like follows :::java PosTag tag = new PosTag("NP", Pos.ProperNoun); The first parameter is the String POS tag used by the POS tagger and the second parameter represents the mapping to the OLiA MorphosyntacticCategories for this tag. The next example shows an sofisticated mapping for the "PWAV" (Pronominaladverb) as used by the STTS tag set for the German language :::java new PosTag("PWAV", LexicalCategory.Adverb, Pos.RelativePronoun, Pos.InterrogativePronoun); TagSet is the other important class as it allows to manage the set of PosTag instances. TagSet has two main functions: First it allows an integrator of an POS tagger with Stanbol to define the mappings from the string POS tags used by the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable used by the Stanbol NLP chain. Second it ensures that there is only a single instance of PosTag used to annotate all Tokens with the same type. _TagSet_s are typically specified as static members of utility classes. The following code snippet shows an example :::java //Tagset is generically typed. We need a TagSet for PosTag's public static final TagSet<PosTag> STTS = new TagSet<PosTag>( "STTS", "de"); //define a name and the languages it supports static { //you can set properties to a TagSet. While supported this //feature is currently not used by Stanbol STTS.getProperties().put("olia.annotationModel", new UriRef("http://purl.org/olia/stts.owl")); STTS.getProperties().put("olia.linkingModel", new UriRef("http://purl.org/olia/stts-link.rdf")); STTS.addTag(new PosTag("ADJA", Pos.AttributiveAdjective)); STTS.addTag(new PosTag("ADJD", Pos.PredicativeAdjective)); STTS.addTag(new PosTag("ADV", LexicalCategory.Adverb)); //[...] } The string tag (first parameter) of the PosTag is used as unique key by the TagSet . Adding an 2nd PasTag with the same tag will override the first one. PosTag_s that are added to a _TagSet have the Tag#getAnnotationModel() property set to that model. The final example shows a code snippet shows the core part of an POS tagging engine using the both the [AnalyzedText] (analyzedtext) and the PosTag and TagSet APIs. :::java TagSet<PosTag> tagSet; //the used TagSet //holds PosTags for tags returned by the POS tagger that //are missing in the TagSet Map<String,PosTag> adhocTags = new HashMap<String,PosTag>(): List<Span> token = new ArrayList<Span>(64); Iterator<Section> sentences; //Iterator over the sentences while(sentences.hasNext()){ Section sentence = sentences.next(); //get the tokens of the current sentence token.clean(); AnalysedTextUtils.appandToList( sentence.getEnclosed(SpanTypeEnum.Token), tokenList); //typically one needs also to get the Strings //of the tokens for the pos tagger String[] tokenText = new String [tokenList.size()] ; for(int i=0;i<tokens.size();i++) { tokenText[i] = tokens.get(i).getSpan(); } //now POS tag the sentence String[] posTags = posTagger.tag(tokens); //finally apply the PosTags and save the annotation for(int i=0;i<tokens.size();i++){ PosTag tag = tagSet.get(posTags [i] ); if(tag == null) { //unmapped tag tag = adhocTags.get(posTags[i]); } if(tag == null) { //unknown tag tag = new PosTag(posTags[i]); adhocTags.put(posTags[i],tag); } //add the annotation to the Token token.addAnnotation( NlpAnnotations.POS_ANNOTATION, Value.value(tag)); } } Phrase annotations Phrase annotations can be used to define the type of a Chunk . The PhraseTag class is used for phrase annotations. It defines first a string tag and secondly the Phrase category. The LexicalCategory enumeration is used as valued for the category. As the PhraseTag is a subclass of Tag it can be also used in combination with the TagSet class as described in the [PosTag and TagSet] section. The following code snippets show how to create a PhraseTag for noun phrases :::java PhraseTag tag = new PhraseTag("NP", LexicalCategory.Noun); Name Entity (NER) annotations Named Entity annotations are created by NER modules. Before the Stanbol NLP chain they where represented in Stanbol by using ' [fise:TextAnnotation] (../enhancementstructure#fisetextannotation)'s and any Enhancement Engine that does NER should still support this. With the Stanbol NLP processing module it is now also possible to represent detected Named Entities as Chunk with an PhraseTag added as Annotation. A Named Entity represented as 'fise:TextAnnotation' includes the following information: urn:namedEntity:1 rdf:type fise:TextAnnotation, fise:Enhancement fise:selected-text {named-entity-text} fise:start {start-char-pos} fise:end {end-char-pos} dc:type {named-entity-type} where: * {named-entity-text} is the text recognized as Named Entity. This is the same as returned by Chunk#getSpan() {start-char-pos} is the start character position of the Named Entity relative to the start of the text. This is the same as Chunk#getStart() {end-char-pos} is the end position and the same as Chunk#getEnd() {named-enttiy-type} is the type of the recognized Named Entity as URI. The _PhraseTag allows to define both the string tag as used by the NER component as well as the URI this type is mapped to. In Stanbol it is preferred to use 'dbpedia:Person', 'dbpedia:Organisation' and 'dbpedia:Place' for the according entity types. The NerTag class extends Tag and can therefore be also used with the TagSet class. This means that users of the API can use TagSet to manage the string tag to URI mappings for the supported Named Entity types. The following Code Snippets shows how to add NER annotations to the AnalysedText: :::java AnalysedText at; //The AnalysedText TagSet<NerTag> nerTags; //registered NER tags Iterator<Section> sections; //sections to iterate over List<String> tokenTexts = new ArrayList<Span>(64); while(sections.hasNext()){ Section section = sections.next(); //NER tagger typically need String[] as input token.clean(); Iterator<Token> tokens = section.getTokens; while(tokens.hasNext()) { tokenTexts.add(tokens.next().getSpan()); } //Span -> #start #end #type #probability Span[] nerSpans = nerTagger.tag( tokenTexts.toArray(new String [tokenTexts.size()] ); for(int i=0; i < nerSpans.length; i++){ Chunk namedEntity = at.addChunk( nerSpans [i] .start,nerSpans [i] .start); NerTag tag = nerTags.get(nerSpans [i] .type) if(tag == null) { //unmapped NER tag = new NerTag(nerSpans[i].type); } namedEntity.addAnnotation( NlpAnnotations.NER_ANNOTATION, Value.value(tag, nerSpans [i] . probability)); } } Note that the above Code Snippet only shows how to add the Named Entity to the AnalyzedText ContentPart. A actual NER engine Implementation needs also to add those information to the metadata of the [ContentItem] (../contentitem). :::java ContentItem ci; //The processed ContentItem Language lang; //The Language of the processed Text MGraph metadata = ci.getMetadata(); Section section; //the current Section Chunk namedEntity //the currently processed Named Entity Value<NerTag> nerAnnotation = namedEntity.getAnnotation( NlpAnnotations.NER_ANNOTATION); UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this); metadata.add(new TripleImpl(textAnnotation, ENHANCER_SELECTED_TEXT, new PlainLiteralImpl(namedEntity.getSpan(), language))); metadata.add.add(new TripleImpl(textAnnotation, ENHANCER_SELECTION_CONTEXT, new PlainLiteralImpl(section.getSpan(), language))); if(tag.getType() != null) { metadata.add(new TripleImpl(textAnnotation, DC_TYPE, nerAnnotation.value().getType)); } //else do not add an dc:type for unmapped NamedEntities g.add(new TripleImpl(textAnnotation, ENHANCER_CONFIDENCE, literalFactory.createTypedLiteral(nerAnnotation.probability()))); g.add(new TripleImpl(textAnnotation, ENHANCER_START, literalFactory.createTypedLiteral(namedEntity.getStart())); g.add(new TripleImpl(textAnnotation, ENHANCER_END, literalFactory.createTypedLiteral(namedEntity.getEnd()))); Morphological Analyses _ NOTE: _ This part of the Stanbol NLP annotations is still work in progress. So this part of the API might undergo heavy changes even in minor releases. The results of a Morphological Analyses are represented by the MorphoFeatures class and can be added to the analyzed word ( Token ) by using the NlpAnnotations.MORPHO_ANNOTATION . The MorphoFeatures class provides the following features: _ Lemma _: A String value representing the lemmatization of the annotated Token. _ Case : The _Case enumeration contains around 70 members defined based on concepts of the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ). The CaseTag allows to define cases and optionally map them to the cases defined by the enumeration. _ Definitness : The _Definitness enumeration has the members Definite and Indefinite also defined by Concepts in the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ). _ Gender : The _Gender enumeration contains the six gender defined by the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ). The GenderTag allows to define Genders and optionally map them to the gender defined by the enumeration. _ Number : The _NumberFeature enumeration defines the eight number features defined by [OLiA] ( http://nlp2rdf.lod2.eu/olia/ ). The NumberTag can be used to define number features and map them to the members of the enumeration _ Person : the _Person enumeration has the definitions for 'first', 'second' and 'third' with mappings to the according concepts of the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ). _ Tense : The _Tense enumeration represents the tense hierarchy as defined by the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ). the Tense#getParent() allows access to the direct parent of a Tense while the Tense#getTenses() method can be used to obtain the transitive closure (including the Tens object itself). TenseTag is used for Tense annotations. It allows both to parse a string tag representing the tense as well as defining a mapping to the tenses defined by the Tense enumeration. _ Mood : The _VerbMood enumeration currently defines members from different part of the [OLiA Ontology] ( http://nlp2rdf.lod2.eu/olia/ ). While OLiA does define the 'ilia:MoodFeature' class but those members had not a good match with verb moods as used by the CELI/linguagrid.org service. For now the decision was to define the VerbMood enumeration more closely to the usage of CELI, but this needs clearly to be validated as soon as implementations for other NLP frameworks are added. Their is also a VerbMoodTag that allows to define verb moods by a string tag and an mapping to the VerbMood enumeration. The MorphoFeatures supports multi valued annotations for all the above features. Getter for a single value will always return the first added value.
        Hide
        rwesten Rupert Westenthaler added a comment -

        Documentation for the in-memory implementation of the AnalyzedText

        In-Memory AnalyzedText and Annotation implementation
        ================

        This describes the implementation of the [Analyzed Text](analysedtext) used by default by the Stanbol NLP processing module. This implementation is directly contained within the org.apache.stanbol.enhancer.nlp module.

          1. AnalyzedTextFactory

        The AnalyzedTextFactory of the in-memory implementation registers itself as OSGI service with an "service.ranking" of Integer.MIN_VALUE. That means that any other registered AnalyzedTextFactory will override this one (unless it does not use Integer.MIN_VALUE itself).

        The implementation uses the ContentItemHelper#getText(Blob blob) method to retrieve the text from the parsed blob. The text is than used to create an AnalyzedText instance.

          1. AnalyzedText Implementation

        The in-memory implementation is based on a NavigableMap that uses the same span as both key and value. TreeMap is currently used as implementation. The compareTo(..) method of the Span implementation ensures the correct ordering of Spans as specified by the [Analyzed Text](analyzedtext) interface. All add**(..) methods first check if a span with the added type, [start,end) is already contained. If this is the case the current span is returned otherwise an new instance is created.

        The Iterator implementation is not based on the Iterators provided by the NavigableMap as those would throw ConcurrentModificationExceptions - what is prohibited by the specification. Instead in implementation that is based on the #higherKey() method is used. Filtered Iterators are implemented using Apache Commons Collections FilteredIterator utility with an Predicate based on the SpanTypeEnum.

          1. Annotation Implementation

        The implementation of the Annotated interface is similar to that of the SolrInputDocument. Internally it uses a Map<Object,Object> to store data. When a single value is added it is directly store in the map. In case of multiple values data are stored in Arrays. Arrays are sorted by an comparator that ensures that the value with the highest probability is at index '0'.

        Type safety is not checked so creating multiple Annotations with different value types that share the same key will cause ClassCastExceptions at runtime.

        Show
        rwesten Rupert Westenthaler added a comment - Documentation for the in-memory implementation of the AnalyzedText In-Memory AnalyzedText and Annotation implementation ================ This describes the implementation of the [Analyzed Text] (analysedtext) used by default by the Stanbol NLP processing module. This implementation is directly contained within the org.apache.stanbol.enhancer.nlp module. AnalyzedTextFactory The AnalyzedTextFactory of the in-memory implementation registers itself as OSGI service with an "service.ranking" of Integer.MIN_VALUE. That means that any other registered AnalyzedTextFactory will override this one (unless it does not use Integer.MIN_VALUE itself). The implementation uses the ContentItemHelper#getText(Blob blob) method to retrieve the text from the parsed blob. The text is than used to create an AnalyzedText instance. AnalyzedText Implementation The in-memory implementation is based on a NavigableMap that uses the same span as both key and value. TreeMap is currently used as implementation. The compareTo(..) method of the Span implementation ensures the correct ordering of Spans as specified by the [Analyzed Text] (analyzedtext) interface. All add**(..) methods first check if a span with the added type, [start,end) is already contained. If this is the case the current span is returned otherwise an new instance is created. The Iterator implementation is not based on the Iterators provided by the NavigableMap as those would throw ConcurrentModificationExceptions - what is prohibited by the specification. Instead in implementation that is based on the #higherKey() method is used. Filtered Iterators are implemented using Apache Commons Collections FilteredIterator utility with an Predicate based on the SpanTypeEnum. Annotation Implementation The implementation of the Annotated interface is similar to that of the SolrInputDocument. Internally it uses a Map<Object,Object> to store data. When a single value is added it is directly store in the map. In case of multiple values data are stored in Arrays. Arrays are sorted by an comparator that ensures that the value with the highest probability is at index '0'. Type safety is not checked so creating multiple Annotations with different value types that share the same key will cause ClassCastExceptions at runtime.
        Hide
        rwesten Rupert Westenthaler added a comment -

        Documentation for Analyzed Text

        AnalysedText
        =====

        The AnalysedText is a Java domain model designed to describe NLP processing results. It describes of two major parts:

        1. Structure of the Text such as text-sections, sentences, chunks and tokens
        2. Annotations for the detected parts of the text.

          1. AnalysetText as ContentPart

        Within the Stanbol Enhancer the AnalysedText is used as [ContentPart](../contentitem#content-parts) registered with the URI <code>urn:stanbol.enhancer:nlp.analysedText</code>

        Because of that it can be retrieved by using the following code

        :::java
        AnalysedText at;
        ci.getLock().readLock().lock();
        try

        { at = ci.getPart(AnalysedText.ANALYSED_TEXT_URI, AnalysedText.class); }

        catch (NoSuchPartException e)

        { //not present at = null; }

        finally

        { ci.getLock().readLock().unlock(); }

        Components that need to create an AnalysedText instance can do so by using the AnalysedTextFactory

        :::java
        @Reference
        AnalysedTextFactory atf;

        ContentItem ci; //the contentItem
        AnalysedText at;
        Entry<String,Blob> plainTextBlob = ContentItemHelper.getBlob(
        ci, Collections.singelton("text/plain"));
        if(plainTextBlob != null){
        //creates and adds the AnalysedText ContentPart to the ContentItem
        ci.getLock().writeLock().lock();
        try

        { at = atf.createAnalysedText(ci,plainTextBlob.value()); }

        finally

        { ci.getLock().writeLock().unlock(); }

        } else

        { //no NLP processing possible at = null; }

        If used outside of OSGI users can also use the AnalysedTextFactory#getDefaultInstance() to obtain the AnalysedTextFactory instance of the in-memory implementation.

          1. Structure of the Text

        The basic building block of the AnalysedText is the Span. A Span defines type, [start,end) as well as the spanText. For the type an enumeration (SpanTypeEnum) with the members Text, TextSection, Sentence, Chunk and Text. [start,end) define the character positions of the Span within the Text where the start position is inclusive and the end position is exclusive.

        Analog to the type of the Span there are also Java interfaces representing those types and providing additional convenience methods. An additional Section interface was introduced as common parent for all types that may have enclosed Spans. The AnalyzedText is the interface representing SpanTypeEnum#Text. The main intension of those Java classes are to have convenience methods that ease the use of the API.

            1. Uniqueness of Spans

        A Span is considered equals to an other Span if [start, end) and type are the same. The natural oder of Spans is defined by

        • smaller start index first
        • bigger end index first
        • higher ordinal number of the SpanTypeEnum first

        This order is used by all Iterators returned by the AnalyzedText API

            1. Concurrent Modifications and Iterators

        Iterators returned by the AnalyzedText API MUST throw _ConcurrentModificationException_s but rather reflect changes to the underlaying model. While this is not constant with the default behavior of Iterators in Java this is central for the effective usage of the AnalyzedText API - e.g. when Iterating over Sentences while adding Tokens.

            1. Code Samples:

        The following Code Snippet shows some typical usages of the API:

        :::java
        AnalysedText at; //typically retrieved from the contentPart
        Iterator<Sentence> sentences = at.getSentences;
        while(sentences.hasNext){
        Sentence sentence = sentences.next();
        String sentText = sentence.getSpan();
        Iterator<SentenceToken> tokens = sentence.getTokens();
        while(tokens.hasNext())

        { Token token = tokens.next(); String tokenText = token.getSpan(); Value<PosTag> pos = token.getAnnotation( NlpAnnotations.posAnnotation); String tag = pos.value().getTag(); double confidence = pos.probability(); }

        }

        Code that adds new Spans looks like follows

        :::java
        //Tokenize an Text
        Iterator<Sentence> sentences = at.getSentences();
        Iterator<? extends Section> sections;
        if(sentences.hasNext())

        { //sentence Annotations presnet sections = sentences; }

        else

        { //if no sentences tokenize the text at once sections = Collections.singelton(at).iterator(); }

        //Tokenize the sections
        for(Section section : sentenceList){
        //assuming the Tokenizer returns tokens as 2dim int array
        int[][] tokenSpans = tokenizer.tokenize(section.getSpan());
        for(int ti = 0; ti < tokenSpans.length; ti++)

        { Token token = section.addToken( tokenSpans[ti][0],tokenSpans[ti][1]); }

        }

        For all #add*(start,end) methods in the API the parsed start and end indexes are relative to the parent (the one the #add(..) method is called). The [start,end) indexes returned by Spans are absolute values. If an #add*(..) method is called for a Span '[start,end):type' that already exists than instead of an new instance the already existing one is returned.

          1. Annotation Support

        Annotation support is provided by two interfaces Annotated and Annotation and the Value class. Annotated provides an API for adding information the the annotated object. Those annotations are represented by key value mappings where Object is used as key and the Value class for values. The Value class provides the generically typed value as well as a double probability in the range [0..1] or -1 if not known. Finally the Annotation class is used to ensure type safety.

        The following example shows the intended usage of the API

        1. One needs to define the Annotations one would like to use. Annotations are typically defined as public static members of interfaces or classes. The following example uses the definition of the Part of Speech annotation.

        :::java
        public interface NlpAnnotations

        { //an Part of Speech Annotation using a String key //and the PosTag class as value Annotation<String,PosTag> POS_ANNOTATION = new Annotation<String,PosTag>( "stanbol.enhancer.nlp.pos", PosTag.class); ... }

        2. Defined Annotation are used to add information to an Annotated instance (like a Span). For adding annotations the use of _Annotation_s is required to ensure type safety. The following code snippet shows how to add an PosTag with the probability 0.95.

        :::java
        PosTag tag = new PosTag("N"); //a simple POS tag
        Token token; //The Token we want to add the tag
        token.addAnnotations(POS_ANNOTATION,Value.value(tag),0.95);

        3. For consuming annotations there are two options. First the possibility to use the Annotation object and second by directly using the key. While the 2nd option is not as nicely to use (as it does not provide type safety) it allows consuming annotations without the need to have the used Annotation in the classpath. The following examples show both options

        :::java
        Iterator<Token> tokens = sentence.getTokens();
        while(tokens.hasNext){
        Token token = tokens.next();
        //use the POS_ANNOTATION to get the PosTag
        PosTag tag = token.getAnnotation(POS_ANNOTATION);
        if(tag != null){
        log.info("{} has PosTag {}",token,tag.value());
        } else {
        log.infor("{} has no PosTag",token);
        }
        //(2) use the key to retrieve values
        String key = "urn:test-dummy";
        Value<?> value = token.getValue(key);
        //the programmer needs to know the type!
        if(v.probability() > 0.5){
        log.info("{}={}",key,value.value());
        }
        }

        The Annotated interface supports multi valued annotations. For that it defines methods for adding/setting and getting multiple values. Values are sorted first by the probability (unknown probability last) and secondly by the insert order (first in first out). So calling the single value getAnnotation() method on a multi valued field will return the first item (highest probability and first added in case of multiple items with the same/no probabilities)

        Show
        rwesten Rupert Westenthaler added a comment - Documentation for Analyzed Text AnalysedText ===== The AnalysedText is a Java domain model designed to describe NLP processing results. It describes of two major parts: 1. Structure of the Text such as text-sections, sentences, chunks and tokens 2. Annotations for the detected parts of the text. AnalysetText as ContentPart Within the Stanbol Enhancer the AnalysedText is used as [ContentPart] (../contentitem#content-parts) registered with the URI <code>urn:stanbol.enhancer:nlp.analysedText</code> Because of that it can be retrieved by using the following code :::java AnalysedText at; ci.getLock().readLock().lock(); try { at = ci.getPart(AnalysedText.ANALYSED_TEXT_URI, AnalysedText.class); } catch (NoSuchPartException e) { //not present at = null; } finally { ci.getLock().readLock().unlock(); } Components that need to create an AnalysedText instance can do so by using the AnalysedTextFactory :::java @Reference AnalysedTextFactory atf; ContentItem ci; //the contentItem AnalysedText at; Entry<String,Blob> plainTextBlob = ContentItemHelper.getBlob( ci, Collections.singelton("text/plain")); if(plainTextBlob != null){ //creates and adds the AnalysedText ContentPart to the ContentItem ci.getLock().writeLock().lock(); try { at = atf.createAnalysedText(ci,plainTextBlob.value()); } finally { ci.getLock().writeLock().unlock(); } } else { //no NLP processing possible at = null; } If used outside of OSGI users can also use the AnalysedTextFactory#getDefaultInstance() to obtain the AnalysedTextFactory instance of the in-memory implementation. Structure of the Text The basic building block of the AnalysedText is the Span. A Span defines type, [start,end) as well as the spanText. For the type an enumeration ( SpanTypeEnum ) with the members Text, TextSection, Sentence, Chunk and Text. [start,end) define the character positions of the Span within the Text where the start position is inclusive and the end position is exclusive. Analog to the type of the Span there are also Java interfaces representing those types and providing additional convenience methods. An additional Section interface was introduced as common parent for all types that may have enclosed Spans. The AnalyzedText is the interface representing SpanTypeEnum#Text. The main intension of those Java classes are to have convenience methods that ease the use of the API. Uniqueness of Spans A Span is considered equals to an other Span if [start, end) and type are the same. The natural oder of Spans is defined by smaller start index first bigger end index first higher ordinal number of the SpanTypeEnum first This order is used by all Iterators returned by the AnalyzedText API Concurrent Modifications and Iterators Iterators returned by the AnalyzedText API MUST throw _ConcurrentModificationException_s but rather reflect changes to the underlaying model. While this is not constant with the default behavior of Iterators in Java this is central for the effective usage of the AnalyzedText API - e.g. when Iterating over Sentences while adding Tokens. Code Samples: The following Code Snippet shows some typical usages of the API: :::java AnalysedText at; //typically retrieved from the contentPart Iterator<Sentence> sentences = at.getSentences; while(sentences.hasNext){ Sentence sentence = sentences.next(); String sentText = sentence.getSpan(); Iterator<SentenceToken> tokens = sentence.getTokens(); while(tokens.hasNext()) { Token token = tokens.next(); String tokenText = token.getSpan(); Value<PosTag> pos = token.getAnnotation( NlpAnnotations.posAnnotation); String tag = pos.value().getTag(); double confidence = pos.probability(); } } Code that adds new Spans looks like follows :::java //Tokenize an Text Iterator<Sentence> sentences = at.getSentences(); Iterator<? extends Section> sections; if(sentences.hasNext()) { //sentence Annotations presnet sections = sentences; } else { //if no sentences tokenize the text at once sections = Collections.singelton(at).iterator(); } //Tokenize the sections for(Section section : sentenceList){ //assuming the Tokenizer returns tokens as 2dim int array int[][] tokenSpans = tokenizer.tokenize(section.getSpan()); for(int ti = 0; ti < tokenSpans.length; ti++) { Token token = section.addToken( tokenSpans[ti][0],tokenSpans[ti][1]); } } For all #add* (start,end) methods in the API the parsed start and end indexes are relative to the parent (the one the #add (..) method is called). The [start,end) indexes returned by Spans are absolute values. If an #add *(..) method is called for a Span '[start,end):type' that already exists than instead of an new instance the already existing one is returned. Annotation Support Annotation support is provided by two interfaces Annotated and Annotation and the Value class. Annotated provides an API for adding information the the annotated object. Those annotations are represented by key value mappings where Object is used as key and the Value class for values. The Value class provides the generically typed value as well as a double probability in the range [0..1] or -1 if not known. Finally the Annotation class is used to ensure type safety. The following example shows the intended usage of the API 1. One needs to define the Annotations one would like to use. Annotations are typically defined as public static members of interfaces or classes. The following example uses the definition of the Part of Speech annotation. :::java public interface NlpAnnotations { //an Part of Speech Annotation using a String key //and the PosTag class as value Annotation<String,PosTag> POS_ANNOTATION = new Annotation<String,PosTag>( "stanbol.enhancer.nlp.pos", PosTag.class); ... } 2. Defined Annotation are used to add information to an Annotated instance (like a Span). For adding annotations the use of _Annotation_s is required to ensure type safety. The following code snippet shows how to add an PosTag with the probability 0.95. :::java PosTag tag = new PosTag("N"); //a simple POS tag Token token; //The Token we want to add the tag token.addAnnotations(POS_ANNOTATION,Value.value(tag),0.95); 3. For consuming annotations there are two options. First the possibility to use the Annotation object and second by directly using the key. While the 2nd option is not as nicely to use (as it does not provide type safety) it allows consuming annotations without the need to have the used Annotation in the classpath. The following examples show both options :::java Iterator<Token> tokens = sentence.getTokens(); while(tokens.hasNext){ Token token = tokens.next(); //use the POS_ANNOTATION to get the PosTag PosTag tag = token.getAnnotation(POS_ANNOTATION); if(tag != null){ log.info("{} has PosTag {}",token,tag.value()); } else { log.infor("{} has no PosTag",token); } //(2) use the key to retrieve values String key = "urn:test-dummy"; Value<?> value = token.getValue(key); //the programmer needs to know the type! if(v.probability() > 0.5){ log.info("{}={}",key,value.value()); } } The Annotated interface supports multi valued annotations. For that it defines methods for adding/setting and getting multiple values. Values are sorted first by the probability (unknown probability last) and secondly by the insert order (first in first out). So calling the single value getAnnotation() method on a multi valued field will return the first item (highest probability and first added in case of multiple items with the same/no probabilities)

          People

          • Assignee:
            rwesten Rupert Westenthaler
            Reporter:
            rwesten Rupert Westenthaler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development