[OPENNLP-1099] Is this a typical tokenization issue? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: 1.8.1
Component/s: Lemmatizer
Labels:
None

Description

I am testing openNLP and found some significant tokenization issue involving punctuation.

Thank you Costco!
i love costco!
I love Costco!!
FUCK IKEA.

In all these cases, the last punctuation is not split so "Costco!" and "IKEA." are treated as one token. This looks like a systematic problem.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: martin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Jun/17 05:52

Updated:: 29/Jun/17 21:53

Resolved:: 29/Jun/17 21:46