Details
Description
Trying to train a tokenizer for Turkish from API, which doesn't learn an obvious pattern. No abbreviation dictionary is used and is either necessary for learning. The sample stream is in UTF-8.
The code sample I used is below:
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(trainFilename),
charset);
ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
TokenizerModel model;
TokenizerFactory factory = new TokenizerFactory("tr",null,false, null);
String tr = factory.getLanguageCode();
model = TokenizerME.train(sampleStream, factory ,TrainingParameters.defaultParams());
try (OutputStream modelOut = new FileOutputStream(WordOptions.OPENNLPTOKENMODELFILENAME)) {
model.serialize(modelOut);
modelOut.close();
}
sampleStream.close();