Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1555

TokenizerME should detect multi-dot abbreviations

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
    • 2.3.4
    • Tokenizer
    • None

    Description

      TokenizerME should detect and handle multi-dot abbreviations correctly. Currently, this is not handled correctly. For instance,

      German: "z.B." = "zum Beispiel" (for example) or,
      Dutch: "e.v." = "en volgende" (and following)

      are not tokenized correctly and extra tokens are returned. NOTE: no whitespaces in between the dots in the above examples.

      Aims:

      • Fix the detection / handling of abbreviations for multi-dot abbreviations
      • Provide test cases that cover these cases

      Attachments

        Activity

          People

            mawiesne Martin Wiesner
            mawiesne Martin Wiesner
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: