• New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.9.0-incubating
    • Enhancer
    • None


      The goal of this Engine is to find Terms defined in a Taxonomy within parsed content. Named Entity Recognition (e.g. the opennlp-ner) engines can not be used for that because Taxonomies typically also contain Entities of types that can not be detected by NER.

      Taxonomies will be stored within a ReferencedSite of the Entityhub. Terms of the Taxonomy will be Entities of the Referenced Site

      For processing of the parsed content (Text) this engine can use the following natural language processing component.

      • OpenNLP tokenizer (SimpleTokenizer with the possibility to add Language specific one)
      • Sentence Detector (optional): If present than the parsed content is analyzed sentence by sentence
      • POS tagger (optional): Part of Speech analyzers tag each token with the type of the Word. If present it allows this engine to look up only words with a specific types (e.g. nouns). If not present this engine will lookup every word in the parsed content.
      • Chunker (optional): Allows to detect phrases within the parsed content. If not present the Engine will try to build chunks based on the POS tags of words (e.g. two nouns in a row or nouns connected with a preposision). If also no POS tags are available results for the current could be compared with surrounding tokens.

      NOTE: all that components other than the Tokenizer are optional. The main reason for there usage is to reduce the number of lookups and therefore to increase the performance.

      The Engine will produce TextAnnotations as well as EntityAnnotations. TextAnnotations will only be created in case an Term in the Taxonomy was found. EntityAnnotations are used to represent suggested Terms within the Taxonomy.

      Even that this Engine will be able to use any ReferencedSite of the Stanbol Entityhub it is intended to be used with Taxonomy like data. If used in combination with general purpose datasets such as dbpedia or freebase it will be only of limited use because such datasets define entities for many commonly used words. This Engine will create Enhancements if such words are present within parsed content. It might still be possible to successfully use this Engine for such datasets, but Users will need to filter results.


        Issue Links



              rwesten Rupert Westenthaler
              rwesten Rupert Westenthaler
              0 Vote for this issue
              0 Start watching this issue