[STANBOL-245] Taxonomy Engine - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.9.0-incubating
Component/s: Enhancer
Labels:
None

Description

The goal of this Engine is to find Terms defined in a Taxonomy within parsed content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be used for that because Taxonomies typically also contain Entities of types that can not be detected by NER.

Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of the Taxonomy will be Entities of the Referenced Site

For processing of the parsed content (Text) this engine can use the following natural language processing component.

OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language specific one)
Sentence Detector (optional): If present than the parsed content is analyzed sentence by sentence
POS tagger (optional): Part of Speech analyzers tag each token with the type of the Word. If present it allows this engine to look up only words with a specific types (e.g. nouns). If not present this engine will lookup every word in the parsed content.
Chunker (optional): Allows to detect phrases within the parsed content. If not present the Engine will try to build chunks based on the POS tags of words (e.g. two nouns in a row or nouns connected with a preposision). If also no POS tags are available results for the current could be compared with surrounding tokens.

NOTE: all that components other than the Tokenizer are optional. The main reason for there usage is to reduce the number of lookups and therefore to increase the performance.

The Engine will produce TextAnnotations as well as EntityAnnotations. TextAnnotations will only be created in case an Term in the Taxonomy was found. EntityAnnotations are used to represent suggested Terms within the Taxonomy.

NOTE:
Even that this Engine will be able to use any ReferencedSite of the Stanbol Entityhub it is intended to be used with Taxonomy like data. If used in combination with general purpose datasets such as dbpedia or freebase it will be only of limited use because such datasets define entities for many commonly used words. This Engine will create Enhancements if such words are present within parsed content. It might still be possible to successfully use this Engine for such datasets, but Users will need to filter results.

Attachments

Issue Links

is superceded by

STANBOL-303 EntityFetch engine

Closed

Activity

People

Assignee:: Rupert Westenthaler

Reporter:: Rupert Westenthaler

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 30/Jun/11 12:54

Updated:: 09/May/12 13:47

Resolved:: 16/Mar/12 08:14