Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1474

Create tokenizer factories for other langs (Spanish, Italian, ...)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.1.1
    • 2.2.0
    • Tokenizer
    • None

    Description

      From https://github.com/apache/opennlp/pull/506#issuecomment-1445849746

      We can create more factories for languages such as Spanish and Italian. For example:

      // From: https://it.wikipedia.org/wiki/Alfabeto_italiano
      private static final Pattern ITALIAN = Pattern.compile("^[0-9a-zàèéìîíòóùüA-ZÀÈÉÌÎÍÒÓÙÜ]+$");
      // From: https://en.wikiversity.org/wiki/Alphabet/Spanish_alphabet & https://en.wikipedia.org/wiki/Spanish_orthography#Alphabet_in_Spanish & https://www.fundeu.es/consulta/tilde-en-la-y-y-griega-o-ye-24786/
      private static final Pattern SPANISH = Pattern.compile("^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$"); 

      Community feedback would be appreciated.

      Attachments

        Issue Links

          Activity

            People

              mawiesne Martin Wiesner
              kinow Bruno P. Kinoshita
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: