Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Because the management of NLP metadata - that is usually available on a word granularity - is not feasible using the RDF metadata this describes the addition of a special ContentPart Stanbol. This ContentPart will have the name AnalysedText.
AnalysedText
=====
- It wraps the text/plain ContentPart of a ContentItem
- It allows the definition of Spans (type, start, end, spanText). Type
is an Enum: Text, TextSection, Sentence, Chunk, Span - Spans are sorted naturally by type, start and end. This allows to
use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
to work with contained Tokens. The #higher and #lower methods of
NavigateableSet even allow to build Iterators that allow concurrent
modifications (e.g adding Chunks while iterating over the Tokens of a
Sentence). - One can attach Annotations to Spans. Basically a multi-valued Map
with Object keys and Value<valueType> value(s) that support a type
save view by using generically typed Annotation<key,valueType> - The Value<valueType> object natively supports confidence. This
allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
tag for Noun) to be used for all noun annotations.
- Note that the AnalysedText does NOT use RDF as representing those
kind of data as RDF is not scaleable enough. This also means that the
data of the AnalysedText are NOT available in the Enhancement Metadata
of the ContentItem. However EnhancementEngines are free to write
all/some results to the AnalysedText AND the RDF metadata of the
ContentItem.
Here is a sample code
AnalysedText at; //the contentPart
Iterator<Sentence> sentences = at.getSentences;
while(sentences.hasNext){
Sentence sentence = sentences.next();
String sentText = sentence.getSpan();
Iterator<SentenceToken> tokens = sentence.getTokens();
while(tokens.hasNext())
}
NLP annotations
=====
- TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
contains Tags of a specific generic type. The Tag only defines a
String "tag" property - Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
defined. Both define also an optional LexicalCategory. This is a enum
with the 12 top level concepts defined by the
[Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
Adjective, Adposition, Adverb ...) - TagSets (including mapped LexicalCategories) are defined for all
languages where POS taggers are available for OpenNLP. This includes
also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
OLIA. The other TagSets used by OpenNLP are currently not available by
Olia. - Note that the LexicalCategory can be used to process POS annotations
of different languages
TagSet:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
POS:
https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
A code sample:
TagSet<PosTag> tagSet; //the used TagSet
Map<String,PosTag> unknown; //missing tags in the TagSet
Token token; //the token
String tag; //the detected tag
double prob; //the probability
PosTag pos = tagset.getTag(tag);
if(pos == null)
if(pos == null)
{ pos = new PosTag(tag); //this tag will not have a LexicalCategory unknown.add(pos); //only one instance } token.addAnnotation(
NlpAnnotations.POSAnnotation,
new Value<PosTag>(pos, prob));