[STANBOL-795] OpenNLP Tokenizer Engine - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: enhancement-engines-0.10.0
Component/s: Enhancement Engines
Labels:
None

Description

Implement an separate OpenNLP Tokenizer Engine.

While some Engines like the OpenNLP POS or the CELI Lemmatizer engine do support tokenizing (if tokens do not already exist in the Analyzed Text) it is important to implement an engine explicitly for this task.

This engine also supports the language configuration (see following example)

en;model=SIMPLE
de;model=mySpecificTokenizerModel_de.bin
!jp
!zh
*

the 'model' parameter can be used to load specific tokenizer models. "SIMPLE" forces the use of the OpenNLP SimpleTokenizer. If no model configuration is present the default tokenizer for the language is loaded ("

{lang}

-token.bin" or the simple tokenizer if the language model is not present).

Attachments

Activity

People

Assignee:: Rupert Westenthaler

Reporter:: Rupert Westenthaler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 06/Nov/12 14:44

Updated:: 12/Apr/13 08:38

Resolved:: 06/Nov/12 15:08