[STANBOL-734] ContentPart for NLP data - AnalyzedText - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: enhancer-0.10.0
Component/s: Enhancer
Labels:
None

Description

Because the management of NLP metadata - that is usually available on a word granularity - is not feasible using the RDF metadata this describes the addition of a special ContentPart Stanbol. This ContentPart will have the name AnalysedText.

AnalysedText
=====

It wraps the text/plain ContentPart of a ContentItem
It allows the definition of Spans (type, start, end, spanText). Type
is an Enum: Text, TextSection, Sentence, Chunk, Span
Spans are sorted naturally by type, start and end. This allows to
use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
to work with contained Tokens. The #higher and #lower methods of
NavigateableSet even allow to build Iterators that allow concurrent
modifications (e.g adding Chunks while iterating over the Tokens of a
Sentence).
One can attach Annotations to Spans. Basically a multi-valued Map
with Object keys and Value<valueType> value(s) that support a type
save view by using generically typed Annotation<key,valueType>
The Value<valueType> object natively supports confidence. This
allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
tag for Noun) to be used for all noun annotations.

Note that the AnalysedText does NOT use RDF as representing those
kind of data as RDF is not scaleable enough. This also means that the
data of the AnalysedText are NOT available in the Enhancement Metadata
of the ContentItem. However EnhancementEngines are free to write
all/some results to the AnalysedText AND the RDF metadata of the
ContentItem.

Here is a sample code

AnalysedText at; //the contentPart
Iterator<Sentence> sentences = at.getSentences;
while(sentences.hasNext){
Sentence sentence = sentences.next();
String sentText = sentence.getSpan();
Iterator<SentenceToken> tokens = sentence.getTokens();
while(tokens.hasNext())

{ Token token = tokens.next(); String tokenText = token.getSpan(); Value<PosTag> pos = token.getAnnotation( NlpAnnotations.posAnnotation); String tag = pos.value().getTag(); double confidence = pos.probability(); }

}

NLP annotations
=====

TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
contains Tags of a specific generic type. The Tag only defines a
String "tag" property
Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
defined. Both define also an optional LexicalCategory. This is a enum
with the 12 top level concepts defined by the
[Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
Adjective, Adposition, Adverb ...)
TagSets (including mapped LexicalCategories) are defined for all
languages where POS taggers are available for OpenNLP. This includes
also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
OLIA. The other TagSets used by OpenNLP are currently not available by
Olia.
Note that the LexicalCategory can be used to process POS annotations
of different languages

TagSet:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
POS:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos

A code sample:

TagSet<PosTag> tagSet; //the used TagSet
Map<String,PosTag> unknown; //missing tags in the TagSet

Token token; //the token
String tag; //the detected tag
double prob; //the probability

PosTag pos = tagset.getTag(tag);
if(pos == null)

{ //unkonw tag pos = unknown.get(tag); }

if(pos == null)

{ pos = new PosTag(tag); //this tag will not have a LexicalCategory unknown.add(pos); //only one instance }

token.addAnnotation(
NlpAnnotations.POSAnnotation,
new Value<PosTag>(pos, prob));

Attachments

Activity

People

Assignee:: Rupert Westenthaler

Reporter:: Rupert Westenthaler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Sep/12 09:41

Updated:: 17/Jul/13 15:16

Resolved:: 21/Nov/12 14:42