[OPENNLP-697] Tokenizer class is hardcoded in the DocumentSampleStream class. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.6.0
Fix Version/s: 1.7.1
Component/s: Doccat, Tokenizer
Labels:
None

Description

While training the DocumentCategorizerME it is possible to set the type of Tokenizer that the categorizer should use.
i,e doccatFactory.setTokenizer(SemicolonTokenizer.INSTANCE);

But the Tokenizer class is hardcoded to WhitespaceTokenizer in the DocumentSampleStream class.
So it is not possible to modify the default tokenizing behaviour even after setting it in the doccatFactory.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Praveena B

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/May/14 06:27

Updated:: 20/Jan/17 14:25

Resolved:: 20/Jan/17 14:25