Description
The Tokenizer has an optimization which skips tokens which are only made of numerics or alpha chars. In foreign languages the alpha chars contain umlauts and other letters which are not included in the a-z range.
Attachments
Issue Links
- relates to
-
OPENNLP-1474 Create tokenizer factories for other langs (Spanish, Italian, ...)
- Closed