• Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: enhancer-0.10.0
    • Component/s: Enhancer
    • Labels:


      Because the management of NLP metadata - that is usually available on a word granularity - is not feasible using the RDF metadata this describes the addition of a special ContentPart Stanbol. This ContentPart will have the name AnalysedText.


      • It wraps the text/plain ContentPart of a ContentItem
      • It allows the definition of Spans (type, start, end, spanText). Type
        is an Enum: Text, TextSection, Sentence, Chunk, Span
      • Spans are sorted naturally by type, start and end. This allows to
        use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
        to work with contained Tokens. The #higher and #lower methods of
        NavigateableSet even allow to build Iterators that allow concurrent
        modifications (e.g adding Chunks while iterating over the Tokens of a
      • One can attach Annotations to Spans. Basically a multi-valued Map
        with Object keys and Value<valueType> value(s) that support a type
        save view by using generically typed Annotation<key,valueType>
      • The Value<valueType> object natively supports confidence. This
        allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
        tag for Noun) to be used for all noun annotations.
      • Note that the AnalysedText does NOT use RDF as representing those
        kind of data as RDF is not scaleable enough. This also means that the
        data of the AnalysedText are NOT available in the Enhancement Metadata
        of the ContentItem. However EnhancementEngines are free to write
        all/some results to the AnalysedText AND the RDF metadata of the

      Here is a sample code

      AnalysedText at; //the contentPart
      Iterator<Sentence> sentences = at.getSentences;
      Sentence sentence =;
      String sentText = sentence.getSpan();
      Iterator<SentenceToken> tokens = sentence.getTokens();

      { Token token =; String tokenText = token.getSpan(); Value<PosTag> pos = token.getAnnotation( NlpAnnotations.posAnnotation); String tag = pos.value().getTag(); double confidence = pos.probability(); }


      NLP annotations

      • TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
        contains Tags of a specific generic type. The Tag only defines a
        String "tag" property
      • Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
        defined. Both define also an optional LexicalCategory. This is a enum
        with the 12 top level concepts defined by the
        [Olia]( ontology (e.g. Noun, Verb,
        Adjective, Adposition, Adverb ...)
      • TagSets (including mapped LexicalCategories) are defined for all
        languages where POS taggers are available for OpenNLP. This includes
        also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
        OLIA. The other TagSets used by OpenNLP are currently not available by
      • Note that the LexicalCategory can be used to process POS annotations
        of different languages


      A code sample:

      TagSet<PosTag> tagSet; //the used TagSet
      Map<String,PosTag> unknown; //missing tags in the TagSet

      Token token; //the token
      String tag; //the detected tag
      double prob; //the probability

      PosTag pos = tagset.getTag(tag);
      if(pos == null)

      { //unkonw tag pos = unknown.get(tag); }

      if(pos == null)

      { pos = new PosTag(tag); //this tag will not have a LexicalCategory unknown.add(pos); //only one instance }

      new Value<PosTag>(pos, prob));




            • Assignee:
              rwesten Rupert Westenthaler
              rwesten Rupert Westenthaler
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: