Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-697

Tokenizer class is hardcoded in the DocumentSampleStream class.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.6.0
    • 1.7.1
    • Doccat, Tokenizer
    • None

    Description

      While training the DocumentCategorizerME it is possible to set the type of Tokenizer that the categorizer should use.
      i,e doccatFactory.setTokenizer(SemicolonTokenizer.INSTANCE);

      But the Tokenizer class is hardcoded to WhitespaceTokenizer in the DocumentSampleStream class.
      So it is not possible to modify the default tokenizing behaviour even after setting it in the doccatFactory.

      Attachments

        Activity

          People

            Unassigned Unassigned
            praveena.b Praveena B
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: