[OPENNLP-1385] Fix discrepancy in tokenizer documentation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.9.4, 2.0.0
Fix Version/s: 2.1.1
Component/s: Documentation, Tokenizer
Labels:
None

Description

In the tokenizer documentation in the user guide, the usage of the tool shows a cutoff option:
-cutoff num
minimal number of times a feature must be seen, ignored if -params is used.
However, this option is not present in the usage when running the CLI:

Arguments description:
-factory factoryName
A sub-class of TokenizerFactory where to get implementation and resources.
-abbDict path
abbreviation dictionary in XML format.
-alphaNumOpt isAlphaNumOpt
Optimization flag to skip alpha numeric tokens for further tokenization
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.

The CLI does not recognize cutoff as an option so it is likely the documentation is incorrect but a review of the code should probably be done first to be sure.

Attachments

Activity

People

Assignee:: Atita Arora

Reporter:: Jeff Zemerick

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Sep/22 18:31

Updated:: 23/Nov/22 14:39

Resolved:: 23/Nov/22 14:38