[OPENNLP-809] Detokenize instead of splitting string with whitespaces - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Name Finder
Labels:
None

Description

Hello,
I do not understand why you are splitting the tokens with a whitespace in RegexNameFinder. It is pointless to me.

When we call `find(String[] token)` you rebuilt the string by appending a whitespace at the end of each token. Why?

I am saying that because maybe the original string has been tokenized by the SimpleTokenizer, and, as you know this tokenizer adds (for example) a whitespace within a word and a point. Example:

Original:
I am visiting Rome.

Tokenized:
I am visiting Rome*[SPLIT]*.

Regex is applied to:
I am visiting Rome .
(instead of the original)

In this version you have introduced a find() method that allows a String instead of String[], but in this case someone pass the original string not the rebuilt string, so the result are different.

Why do not apply a detokenize method to do the EXACT inverse operation of the tokenization? (and get the original string again instead of a modified string)

Thanks.

Attachments

Activity

People

Assignee:: Jörn Kottmann

Reporter:: Damiano

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Aug/15 14:32

Updated:: 27/Apr/16 08:42