OpenNLP
  1. OpenNLP
  2. OPENNLP-203

UIMA Sentence Detector Trainer builds models which do not split correctly the sentences

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tools-1.5.1-incubating
    • Fix Version/s: tools-1.5.2-incubating
    • Labels:
      None
    • Environment:

      Description

      The models trained with the UIMA component give wrong begin/end offset despite the fact they manage to split text in sentences.
      I observed that the begin of a current sentence starts including as a first token the punctuation character of the previous one while the
      previous one does not include it as its last one.

        Activity

        Hide
        Joern Kottmann added a comment -

        Can you confirm that this issue only occurs when you use a model trained by the UIMA Sentence Detector Trainer?

        So you do the following:
        1. Traing a model with the UIMA Sentence Detector Trainer
        2. Load the model from 1. and run it over text

        Then you observe the wrong offsets, right?

        But when you use a pre-build sentence model the offsets are correct, right?

        Show
        Joern Kottmann added a comment - Can you confirm that this issue only occurs when you use a model trained by the UIMA Sentence Detector Trainer? So you do the following: 1. Traing a model with the UIMA Sentence Detector Trainer 2. Load the model from 1. and run it over text Then you observe the wrong offsets, right? But when you use a pre-build sentence model the offsets are correct, right?
        Hide
        Nicolas Hernandez added a comment -

        Yes I confirm.

        Following the steps you described I observed the wrong offsets.
        The offsets were correct when I used the pre build sentence model (like the English one comming from [1]) or when I built a model by using the command line instruction.

        For all of these building configurations I used the UIMA OpenNLP SentenceDetector.

        [1] http://opennlp.sourceforge.net/models-1.5/

        Show
        Nicolas Hernandez added a comment - Yes I confirm. Following the steps you described I observed the wrong offsets. The offsets were correct when I used the pre build sentence model (like the English one comming from [1] ) or when I built a model by using the command line instruction. For all of these building configurations I used the UIMA OpenNLP SentenceDetector. [1] http://opennlp.sourceforge.net/models-1.5/
        Hide
        Joern Kottmann added a comment -

        This issue is linked to the usage of the "useTokenEnd" option, if it is false the code which computes the span makes the above described off by one error.

        For now I suggest the UIMA Sentence Detector Trainer uses the same default as the cmd line version. Beside that we should fix the issue in the Sentence Detector ME code.

        Show
        Joern Kottmann added a comment - This issue is linked to the usage of the "useTokenEnd" option, if it is false the code which computes the span makes the above described off by one error. For now I suggest the UIMA Sentence Detector Trainer uses the same default as the cmd line version. Beside that we should fix the issue in the Sentence Detector ME code.
        Hide
        Joern Kottmann added a comment -

        useTokenEnd is not set to true, can you please test that this fixes your issue?

        Show
        Joern Kottmann added a comment - useTokenEnd is not set to true, can you please test that this fixes your issue?
        Hide
        Nicolas Hernandez added a comment - - edited

        I beg your pardon. How can I test that ? There is no such parameter in
        the descriptor. What am I missing ?

        Show
        Nicolas Hernandez added a comment - - edited I beg your pardon. How can I test that ? There is no such parameter in the descriptor. What am I missing ?
        Hide
        Joern Kottmann added a comment -

        Check out and build the current head trunk. The parameter is not configurable. I will hopefully soon have time to work on a little sentence detector refactoring.

        Show
        Joern Kottmann added a comment - Check out and build the current head trunk. The parameter is not configurable. I will hopefully soon have time to work on a little sentence detector refactoring.
        Hide
        Nicolas Hernandez added a comment -

        All right.
        I confirm it works now.

        I tested by using a sample of the europarl-v6 corpus [1].
        cat europarl-v6.fr-en.fr | perl -ne "if (/[\.\?\!\:\;\'\"»…]$/g)

        { print;}

        "| head -n 1000 > europarl-v6.fr-en.fr.1KSent

        I used the Apache whitespace tokenizer then the OpenNLP UIMA SentenceDetectorTrainer to build a model.
        And I tested the model with the OpenNLP UIMA SentenceDetector.

        [1] http://www.statmt.org/europarl/

        Show
        Nicolas Hernandez added a comment - All right. I confirm it works now. I tested by using a sample of the europarl-v6 corpus [1] . cat europarl-v6.fr-en.fr | perl -ne "if (/ [\.\?\!\:\;\'\"»…] $/g) { print;} "| head -n 1000 > europarl-v6.fr-en.fr.1KSent I used the Apache whitespace tokenizer then the OpenNLP UIMA SentenceDetectorTrainer to build a model. And I tested the model with the OpenNLP UIMA SentenceDetector. [1] http://www.statmt.org/europarl/
        Hide
        Joern Kottmann added a comment -

        You might encounter one more issue. The sentence detector labels each potential end of sentence character as either a sentence-end or no-sentence-end. Based on your input file such samples are generated for training. In the input file each sentence is written in a line, and the sample generation code assumes that the last end of sentence character in the line is the true sentence-end.

        In your europarl file there are lines which do not end with a end sentence character but might contain tokens with end of sentence characters.
        For example:

        Dr. Smith said: <- In this sample the dot in Dr. would be mistaken for a sentence end.

        Show
        Joern Kottmann added a comment - You might encounter one more issue. The sentence detector labels each potential end of sentence character as either a sentence-end or no-sentence-end. Based on your input file such samples are generated for training. In the input file each sentence is written in a line, and the sample generation code assumes that the last end of sentence character in the line is the true sentence-end. In your europarl file there are lines which do not end with a end sentence character but might contain tokens with end of sentence characters. For example: Dr. Smith said: <- In this sample the dot in Dr. would be mistaken for a sentence end.
        Hide
        Nicolas Hernandez added a comment -

        As shown in my previous file, I filtered the lines which do not end with an assumed end sentence character.

        Anyway I tested a model trained on 1 million of europarl sentences.

        In my text test, I have one 'M.' ('Mr.' in French) in the middle of one sentence and two occurences as the first token of a sentence.
        The two occurrences begining a sentence are wrongly split but not the other one.
        I do not infer anything but I note it partially works.
        'M.' occurs 58 times in the training corpus.

        Show
        Nicolas Hernandez added a comment - As shown in my previous file, I filtered the lines which do not end with an assumed end sentence character. Anyway I tested a model trained on 1 million of europarl sentences. In my text test, I have one 'M.' ('Mr.' in French) in the middle of one sentence and two occurences as the first token of a sentence. The two occurrences begining a sentence are wrongly split but not the other one. I do not infer anything but I note it partially works. 'M.' occurs 58 times in the training corpus.
        Hide
        Joern Kottmann added a comment -

        I usually debug such issues by closely inspecting the training data. Is there a case in the training data where it splits after M. ? Are there samples in the training data where M. occurs at the begin of the sentence?

        Could also be caused by encoding issues. Sure the models could also just make a classification mistake.
        I suggest to use the integrated evaluation and more samples to get meaningful results, for english we end up somewhere in the 99% accuracy range.
        If you do not want to prepare a test file you could use cross validation.

        We should also consider adding direct support to OpenNLP to train it on europarl files.

        Show
        Joern Kottmann added a comment - I usually debug such issues by closely inspecting the training data. Is there a case in the training data where it splits after M. ? Are there samples in the training data where M. occurs at the begin of the sentence? Could also be caused by encoding issues. Sure the models could also just make a classification mistake. I suggest to use the integrated evaluation and more samples to get meaningful results, for english we end up somewhere in the 99% accuracy range. If you do not want to prepare a test file you could use cross validation. We should also consider adding direct support to OpenNLP to train it on europarl files.
        Hide
        Joern Kottmann added a comment -

        I created a follow up refactoring issue OPENNLP-205.

        Show
        Joern Kottmann added a comment - I created a follow up refactoring issue OPENNLP-205 .
        Hide
        Joern Kottmann added a comment -

        It was confirmed that the workaround fixed this issue, a more proper fix needs to be done as part of an refactoring of the Sentence Detector. See OPENNLP-205 for details.

        Show
        Joern Kottmann added a comment - It was confirmed that the workaround fixed this issue, a more proper fix needs to be done as part of an refactoring of the Sentence Detector. See OPENNLP-205 for details.

          People

          • Assignee:
            Joern Kottmann
            Reporter:
            Nicolas Hernandez
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development