Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-857

ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0
    • Component/s: Parser
    • Labels:
      None
    • Flags:
      Patch

      Description

      It would be nice if the ParserTool would make use of a real tokenizer. In addition to being the "right" thing to do, it would obviate issues like OPENNLP-240 when using the parser tool.

      While I realize that java.util.StringTokenizer effectively does the same work as WhitespaceTokenizer, it seems odd to use the former when the latter exists.

      To this end, I'm attaching a patch that adds an additional method
      public static Parse[] parseLine(String line, Parser parser, Tokenizer tokenizer, int numParses)

      I've left the existing method
      public static Parse[] parseLine(String line, Parser parser, int numParses)
      in for convenience and backwards compatibility. It simply calls the new method with WhitespaceTokenizer.INSTANCE

      For good measure, I've added a new command-line argument -tk, which takes the name of a tokenizer model. If none is specified, it will fall back on the current behavior of using the whitespace tokenizer.

        Attachments

        1. ParserToolTokenize.patch
          4 kB
          Tristan Nixon

          Activity

            People

            • Assignee:
              joern Jörn Kottmann
              Reporter:
              tnixon Tristan Nixon
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: