[OPENNLP-590] Tokenizer is not getting trained... - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: tools-1.5.3
Fix Version/s: None
Component/s: Tokenizer
Labels:
Environment:
Ubuntu 12.04 - JVM 1.7

Description

Trying to train a tokenizer for Turkish from API, which doesn't learn an obvious pattern. No abbreviation dictionary is used and is either necessary for learning. The sample stream is in UTF-8.

The code sample I used is below:

Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(trainFilename),
charset);
ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
TokenizerModel model;
TokenizerFactory factory = new TokenizerFactory("tr",null,false, null);
String tr = factory.getLanguageCode();
model = TokenizerME.train(sampleStream, factory ,TrainingParameters.defaultParams());
try (OutputStream modelOut = new FileOutputStream(WordOptions.OPENNLPTOKENMODELFILENAME)) {
model.serialize(modelOut);
modelOut.close();
}

sampleStream.close();

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Hayri Volkan Agun

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Jul/13 13:54

Updated:: 06/Jan/14 16:50

Resolved:: 06/Jan/14 16:50

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified