Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1385

Fix discrepancy in tokenizer documentation

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.9.4, 2.0.0
    • 2.1.1
    • Documentation, Tokenizer
    • None

    Description

      In the tokenizer documentation in the user guide, the usage of the tool shows a cutoff option:
      -cutoff num
      minimal number of times a feature must be seen, ignored if -params is used.
      However, this option is not present in the usage when running the CLI:

      Arguments description:
              -factory factoryName
                      A sub-class of TokenizerFactory where to get implementation and resources.
              -abbDict path
                      abbreviation dictionary in XML format.
              -alphaNumOpt isAlphaNumOpt
                      Optimization flag to skip alpha numeric tokens for further tokenization
              -params paramsFile
                      training parameters file.
              -lang language
                      language which is being processed.
              -model modelFile
                      output model file.
              -data sampleData
                      data to be used, usually a file name.
              -encoding charsetName
                      encoding for reading and writing text, if absent the system default is used.

      The CLI does not recognize cutoff as an option so it is likely the documentation is incorrect but a review of the code should probably be done first to be sure.

      Attachments

        Activity

          People

            aarora Atita Arora
            jzemerick Jeff Zemerick
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: