Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-809

Detokenize instead of splitting string with whitespaces

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • None
    • Name Finder
    • None

    Description

      Hello,
      I do not understand why you are splitting the tokens with a whitespace in RegexNameFinder. It is pointless to me.

      When we call `find(String[] token)` you rebuilt the string by appending a whitespace at the end of each token. Why?

      I am saying that because maybe the original string has been tokenized by the SimpleTokenizer, and, as you know this tokenizer adds (for example) a whitespace within a word and a point. Example:

      Original:
      I am visiting Rome.

      Tokenized:
      I am visiting Rome*[SPLIT]*.

      Regex is applied to:
      I am visiting Rome .
      (instead of the original)

      In this version you have introduced a find() method that allows a String instead of String[], but in this case someone pass the original string not the rebuilt string, so the result are different.

      Why do not apply a detokenize method to do the EXACT inverse operation of the tokenization? (and get the original string again instead of a modified string)

      Thanks.

      Attachments

        Activity

          People

            joern Jörn Kottmann
            damiano Damiano
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: