Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1163

Sentence detector doesn't spot abbreviations next to punctuation

    XMLWordPrintableJSON

Details

    Description

      The Sentence Detector trained with an abbreviations list (see attachment) fails to spot them within a text if they are preceded by a punctuation mark.

      In Italian, words starting with a vowel may be preceded by an article plus apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term ARTICOLO, especially in legal text, is frequently abbreviated to ART.

      Repro steps:
      1) add the "art." abbreviation in the abbreviations XML file (enclosed, ctrl+F "art.", case insensitive)
      2) train a model for the Italian language (training set enclosed) with the following command:
      opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model it-sen.bin -data training-set.txt -encoding UTF-8
      3) run the model against a test text with the following command:
      opennlp SentenceDetector it-sen.bin < test.txt

      Even though the abbreviation "art." was included in the XML file, the sentence detector breaks the sentence on instances of this abbreviation preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
      The issue isn't observed if the apostrophe (single quote) is replaced by a space character.

      Attachments

        1. test.txt
          3 kB
          Gabriele Vaccari
        2. it-abbr.txt
          22 kB
          Gabriele Vaccari
        3. training-set.txt
          1.75 MB
          Gabriele Vaccari
        4. out.txt
          3 kB
          Gabriele Vaccari

        Activity

          People

            mawiesne Martin Wiesner
            StarWalker777 Gabriele Vaccari
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1.75h
                1.75h