[OPENNLP-857] ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: Parser
Labels:
None

Flags:

Patch

Description

It would be nice if the ParserTool would make use of a real tokenizer. In addition to being the "right" thing to do, it would obviate issues like ~~OPENNLP-240~~ when using the parser tool.

While I realize that java.util.StringTokenizer effectively does the same work as WhitespaceTokenizer, it seems odd to use the former when the latter exists.

To this end, I'm attaching a patch that adds an additional method
public static Parse[] parseLine(String line, Parser parser, Tokenizer tokenizer, int numParses)

I've left the existing method
public static Parse[] parseLine(String line, Parser parser, int numParses)
in for convenience and backwards compatibility. It simply calls the new method with WhitespaceTokenizer.INSTANCE

For good measure, I've added a new command-line argument -tk, which takes the name of a tokenizer model. If none is specified, it will fall back on the current behavior of using the whitespace tokenizer.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ParserToolTokenize.patch
09/Jul/16 20:58
4 kB
Tristan Nixon

Activity

People

Assignee:: Jörn Kottmann

Reporter:: Tristan Nixon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jul/16 20:57

Updated:: 15/Dec/16 15:20

Resolved:: 02/Nov/16 18:26