[OPENNLP-1555] TokenizerME should detect multi-dot abbreviations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
Fix Version/s: 2.3.4
Component/s: Tokenizer
Labels:
None

Description

TokenizerME should detect and handle multi-dot abbreviations correctly. Currently, this is not handled correctly. For instance,

German: "z.B." = "zum Beispiel" (for example) or,
Dutch: "e.v." = "en volgende" (and following)

are not tokenized correctly and extra tokens are returned. NOTE: no whitespaces in between the dots in the above examples.

Aims:

Fix the detection / handling of abbreviations for multi-dot abbreviations
Provide test cases that cover these cases

Attachments

Activity

People

Assignee:: Martin Wiesner

Reporter:: Martin Wiesner

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Apr/24 13:34

Updated:: 02/May/24 07:32

Resolved:: 02/May/24 07:32