Details
Description
TokenizerME should detect and handle multi-dot abbreviations correctly. Currently, this is not handled correctly. For instance,
German: "z.B." = "zum Beispiel" (for example) or,
Dutch: "e.v." = "en volgende" (and following)
are not tokenized correctly and extra tokens are returned. NOTE: no whitespaces in between the dots in the above examples.
Aims:
- Fix the detection / handling of abbreviations for multi-dot abbreviations
- Provide test cases that cover these cases