[OPENNLP-141] Tokenizers alpha numeric optimization only recognizes a-z as alpha chars - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: tools-1.5.0-sourceforge
Fix Version/s: 2.2.0
Component/s: Tokenizer
Labels:
None

Description

The Tokenizer has an optimization which skips tokens which are only made of numerics or alpha chars. In foreign languages the alpha chars contain umlauts and other letters which are not included in the a-z range.

Attachments

Issue Links

relates to

OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, ...)

Closed

Activity

People

Assignee:: Martin Wiesner

Reporter:: Jörn Kottmann

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Mar/11 13:06

Updated:: 22/Apr/23 17:39

Resolved:: 02/Mar/23 06:20