Description
From https://github.com/apache/opennlp/pull/506#issuecomment-1445849746
We can create more factories for languages such as Spanish and Italian. For example:
// From: https://it.wikipedia.org/wiki/Alfabeto_italiano private static final Pattern ITALIAN = Pattern.compile("^[0-9a-zàèéìîíòóùüA-ZÀÈÉÌÎÍÒÓÙÜ]+$"); // From: https://en.wikiversity.org/wiki/Alphabet/Spanish_alphabet & https://en.wikipedia.org/wiki/Spanish_orthography#Alphabet_in_Spanish & https://www.fundeu.es/consulta/tilde-en-la-y-y-griega-o-ye-24786/ private static final Pattern SPANISH = Pattern.compile("^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$");
Community feedback would be appreciated.
Attachments
Issue Links
- is related to
-
OPENNLP-141 Tokenizers alpha numeric optimization only recognizes a-z as alpha chars
- Closed
-
OPENNLP-1479 Write better tests for pattern verification (tokenizers)
- Closed