[OPENNLP-1202] Word tokenization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Feedback Received
Affects Version/s: None
Fix Version/s: None
Component/s: language model
Labels:
- Annotations
Environment:
Windows Server 2016, R version 3.3.3

Flags:

Important

Description

Came across an issue for identifying words in a sentence. For words such as can't, the tokenization using openNLP yields two words: "ca" and "n't"

As an example (captured in the screenshot), see the tokenization for the string

When heard the Xenogears soundtrack, so can't really describe.

Note the words marked by ID's 9 and 10 in the openNLP-output.png file.

Not sure if I am missing any parameters that would produce the correct result?

Would appreciate any ideas/community's attention to this issue. Thanks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

openNLPTest.r
12/Jun/18 00:24
0.4 kB
Dippy Aggarwal
openNLPTest.py
23/Aug/20 03:46
0.2 kB
Bharani Sruthi
OpenNLPSampleProgramOutput.png
23/Aug/20 03:45
114 kB
Bharani Sruthi
openNLP-output.png
12/Jun/18 00:24
15 kB
Dippy Aggarwal
contractionsdiff.txt
23/Aug/20 03:46
5 kB
Bharani Sruthi

Activity

People

Assignee:: Unassigned

Reporter:: Dippy Aggarwal

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Jun/18 00:30

Updated:: 16/Dec/22 12:30

Resolved:: 16/Dec/22 12:30