Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-590

Tokenizer is not getting trained...

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • tools-1.5.3
    • None
    • Tokenizer
    • Ubuntu 12.04 - JVM 1.7

    Description

      Trying to train a tokenizer for Turkish from API, which doesn't learn an obvious pattern. No abbreviation dictionary is used and is either necessary for learning. The sample stream is in UTF-8.

      The code sample I used is below:

      Charset charset = Charset.forName("UTF-8");
      ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream(trainFilename),
      charset);
      ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
      TokenizerModel model;
      TokenizerFactory factory = new TokenizerFactory("tr",null,false, null);
      String tr = factory.getLanguageCode();
      model = TokenizerME.train(sampleStream, factory ,TrainingParameters.defaultParams());
      try (OutputStream modelOut = new FileOutputStream(WordOptions.OPENNLPTOKENMODELFILENAME)) {
      model.serialize(modelOut);
      modelOut.close();
      }

      sampleStream.close();

      Attachments

        Activity

          People

            Unassigned Unassigned
            volkanagun@gmail.com Hayri Volkan Agun
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2m
                2m
                Remaining:
                Remaining Estimate - 2m
                2m
                Logged:
                Time Spent - Not Specified
                Not Specified