OpenNLP
  1. OpenNLP
  2. OPENNLP-477

DictionaryNameFinder evaluation always returns 0, 0, -1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tools-1.5.2-incubating
    • Fix Version/s: tools-1.5.3
    • Component/s: Name Finder
    • Labels:
    • Environment:
      Ubuntu 11.10 x64 Java 1.7 update 3

      Description

      The NameFinderEvaluator expects typed spans, but the DictionaryNameFinder outputs the old untyped spans. As a result, evaluation for the DictionaryNameFinder always returns 0, 0, -1 regardless of finding loads of entities.

        Activity

        Hide
        Joern Kottmann added a comment -

        It was just not implemented yet.

        Show
        Joern Kottmann added a comment - It was just not implemented yet.
        Hide
        Jim Piliouras added a comment -

        Ok you're right...

        Jim

        Show
        Jim Piliouras added a comment - Ok you're right... Jim
        Hide
        Jim Piliouras added a comment -

        Yes, that sounds really good...I was thinking more about what you can do
        now...how come you don't store the type in the DictionaryNameFinder? It
        is just a string you need to store and since there is no multi-type
        support you just get it once from the first annotation...I don't see why
        that can't happen...

        Jim

        Show
        Jim Piliouras added a comment - Yes, that sounds really good...I was thinking more about what you can do now...how come you don't store the type in the DictionaryNameFinder? It is just a string you need to store and since there is no multi-type support you just get it once from the first annotation...I don't see why that can't happen... Jim
        Hide
        William Colen added a comment -

        But anyway I think it is not related to this jira. We can discuss it in the users list.

        Show
        William Colen added a comment - But anyway I think it is not related to this jira. We can discuss it in the users list.
        Hide
        William Colen added a comment -

        Today we don't have a way of creating a DictionaryNameFinder that finds multiple entity types. We don't even store the types in the dictionary. We should discuss the alternatives:

        1. Add support to typed dictionary
        2. We should be able to set a type in DictionaryNameFinder and it would always output the same type. Users can combine multiple DictionaryNameFinders to have multiple types.

        Show
        William Colen added a comment - Today we don't have a way of creating a DictionaryNameFinder that finds multiple entity types. We don't even store the types in the dictionary. We should discuss the alternatives: 1. Add support to typed dictionary 2. We should be able to set a type in DictionaryNameFinder and it would always output the same type. Users can combine multiple DictionaryNameFinders to have multiple types.
        Hide
        Jim Piliouras added a comment -

        I see...Can i assume then, that you simply wouldn't be able to evaluate
        your Dictionary on a test-set that has several types of entities?

        I don't know if i'm missing something but it sounds very weird that it
        is able to find the entities correctly (not while evaluating), but it
        classifies them as false during evaluation...

        Ok i guess i have to keep a separate test-set for the
        dictionary...thank-god we've got regex!

        Thanks guys

        Jim

        Show
        Jim Piliouras added a comment - I see...Can i assume then, that you simply wouldn't be able to evaluate your Dictionary on a test-set that has several types of entities? I don't know if i'm missing something but it sounds very weird that it is able to find the entities correctly (not while evaluating), but it classifies them as false during evaluation... Ok i guess i have to keep a separate test-set for the dictionary...thank-god we've got regex! Thanks guys Jim
        Hide
        Jim Piliouras added a comment -

        Yes i noticed that and i thought that could be the problem...are you
        saying i should change my whole test-set? Why are the tags changed for
        the dictionary but not for the maxent model? Ok, let me try this small
        paragraph with changed tags and i'll let you know...

        Jim

        Show
        Jim Piliouras added a comment - Yes i noticed that and i thought that could be the problem...are you saying i should change my whole test-set? Why are the tags changed for the dictionary but not for the maxent model? Ok, let me try this small paragraph with changed tags and i'll let you know... Jim
        Hide
        William Colen added a comment -

        It is exactly what Jörn said. DictionaryNameFinder only outputs names with 'default' type.

        Show
        William Colen added a comment - It is exactly what Jörn said. DictionaryNameFinder only outputs names with 'default' type.
        Hide
        Jim Piliouras added a comment -

        Ok there is definitely a problem here...just have a look at my output...all the correct finding are classified as both false positives AND false negatives!!! This just doesn't make sense...it is easy to visualize it's impossible by drawing 2 intersecting circles representing what you found against what you should have found. The leftmost side is false positives while the rightmost side is false negatives and the middle bit are the true positives (aka precision)...My output follows:
        -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

        Evaluating Dictionary on test-data:
        -----------------------------------
        Expected:

        { <START:drug> Folic acid <END> is one variable, but other factors remain.}

        Predicted:

        { <START:default> Folic acid <END> is one variable, but other factors remain.}

        False positives:

        { [Folic acid] } False negatives: { [Folic acid] }

        Expected:

        { To test this hypothesis pregnant rats were exposed to either the GABA a agonist <START:drug> muscimol <END> (1, 2 or 4 mg/kg), the GABA a antagonist <START:drug> bicuculline <END> (.5, 1, or 2 mg/kg), the GABA b agonist <START:drug> baclofen <END> (15, 30, 60 mg/kg), or the GABA b antagonist <START:drug> hydroxysaclofen <END> (1, 3, or 5 mg/kg) during neural tube formation.}

        Predicted:

        { To test this hypothesis pregnant rats were exposed to either the GABA a agonist muscimol (1, 2 or 4 mg/kg), the GABA a antagonist bicuculline (.5, 1, or 2 mg/kg), the GABA b agonist <START:default> baclofen <END> (15, 30, 60 mg/kg), or the GABA b antagonist hydroxysaclofen (1, 3, or 5 mg/kg) during neural tube formation.}

        False positives:

        { [baclofen] }

        False negatives:

        { [muscimol, bicuculline, baclofen, hydroxysaclofen] }

        Expected:

        { Normal saline was used as a control and <START:drug> valproic acid <END> (600 mg/kg) as a positive control.}

        Predicted:

        { Normal saline was used as a control and <START:default> valproic acid <END> (600 mg/kg) as a positive control.}

        False positives:

        { [valproic acid] } False negatives: { [valproic acid] }

        "Elapsed time: 14.5832 msecs"

        STATISTICS FOLLOW:

        Precision: 0.0
        Recall: 0.0
        F-Measure: -1.0
        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

        Is this similar to what you meant you noticed William?
        I think the issue has to be reopened...or at least if you think it doesn't match the title anymore i can open a new one...

        Jim

        Show
        Jim Piliouras added a comment - Ok there is definitely a problem here...just have a look at my output...all the correct finding are classified as both false positives AND false negatives!!! This just doesn't make sense...it is easy to visualize it's impossible by drawing 2 intersecting circles representing what you found against what you should have found. The leftmost side is false positives while the rightmost side is false negatives and the middle bit are the true positives (aka precision)...My output follows: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Evaluating Dictionary on test-data: ----------------------------------- Expected: { <START:drug> Folic acid <END> is one variable, but other factors remain.} Predicted: { <START:default> Folic acid <END> is one variable, but other factors remain.} False positives: { [Folic acid] } False negatives: { [Folic acid] } Expected: { To test this hypothesis pregnant rats were exposed to either the GABA a agonist <START:drug> muscimol <END> (1, 2 or 4 mg/kg), the GABA a antagonist <START:drug> bicuculline <END> (.5, 1, or 2 mg/kg), the GABA b agonist <START:drug> baclofen <END> (15, 30, 60 mg/kg), or the GABA b antagonist <START:drug> hydroxysaclofen <END> (1, 3, or 5 mg/kg) during neural tube formation.} Predicted: { To test this hypothesis pregnant rats were exposed to either the GABA a agonist muscimol (1, 2 or 4 mg/kg), the GABA a antagonist bicuculline (.5, 1, or 2 mg/kg), the GABA b agonist <START:default> baclofen <END> (15, 30, 60 mg/kg), or the GABA b antagonist hydroxysaclofen (1, 3, or 5 mg/kg) during neural tube formation.} False positives: { [baclofen] } False negatives: { [muscimol, bicuculline, baclofen, hydroxysaclofen] } Expected: { Normal saline was used as a control and <START:drug> valproic acid <END> (600 mg/kg) as a positive control.} Predicted: { Normal saline was used as a control and <START:default> valproic acid <END> (600 mg/kg) as a positive control.} False positives: { [valproic acid] } False negatives: { [valproic acid] } "Elapsed time: 14.5832 msecs" STATISTICS FOLLOW: Precision: 0.0 Recall: 0.0 F-Measure: -1.0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Is this similar to what you meant you noticed William? I think the issue has to be reopened...or at least if you think it doesn't match the title anymore i can open a new one... Jim
        Hide
        Joern Kottmann added a comment -

        > I know i said that the new code does not return 0, 0, -1 anymore but whenever i try to evaluate the dictionary using the above small paragraph it returns 0, 0 , -1 again!!!

        The tags in your sample e.g. <START:drug>use as type drug, where the dictionary name finder only outputs types which are default. So they do not match. That is why you get 0,0,-1. Try it with <START:default>.

        Show
        Joern Kottmann added a comment - > I know i said that the new code does not return 0, 0, -1 anymore but whenever i try to evaluate the dictionary using the above small paragraph it returns 0, 0 , -1 again!!! The tags in your sample e.g. <START:drug>use as type drug, where the dictionary name finder only outputs types which are default. So they do not match. That is why you get 0,0,-1. Try it with <START:default>.
        Hide
        William Colen added a comment -

        Hi Jim

        Use the NameEvaluationErrorListener to log errors and check if the output contains overlapped names. I notice it in a test here. Maybe it is related to OPENNLP-471.

        Show
        William Colen added a comment - Hi Jim Use the NameEvaluationErrorListener to log errors and check if the output contains overlapped names. I notice it in a test here. Maybe it is related to OPENNLP-471 .
        Hide
        Jim Piliouras added a comment -

        Hey William,

        I think there is a problem....perhaps the issue needs to be reopened. Let me explain...

        Basically, i am getting such low numbers from the evaluator that made me suspicious...A big part of my corpus (roughly 40 %) was annotated automatically using entries from the dictionary. This means that at least on that part of the corpus the dictionary should have 100% recall because all the annotations exist in the dictionary (they were made from it!!!). Also if you think about it, precision should always be 100% for the DictionaryNameFinder simply because it is very unlikely that a dictionary will contain wrong entries. For example i'm using drugBank.xml...there is no way that it contains entries which are not drugs. In other words the dictionary will never make a mistake...of course it can miss some because no one can guarantee that any dictionary will be complete but that should only affect recall and not precision. Let me elaborate on how i tested it:

        created the following paragraph for testing (sentences on separate lines):
        -------------------------------------------------------------

        <START:drug> Folic acid <END> is one variable, but other factors remain.
        Studies suggest that substances active at the GABA receptor may produce NTDs.
        To test this hypothesis pregnant rats were exposed to either the GABA a agonist <START:drug> muscimol <END> (1, 2 or 4 mg/kg), the GABA a antagonist <START:drug> bicuculline <END> (.5, 1, or 2 mg/kg), the GABA b agonist <START:drug> baclofen <END> (15, 30, 60 mg/kg), or the GABA b antagonist <START:drug> hydroxysaclofen <END> (1, 3, or 5 mg/kg) during neural tube formation.
        Normal saline was used as a control and <START:drug> valproic acid <END> (600 mg/kg) as a positive control.
        -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

        Now, bear with me for a second...You probably noticed that there are 6 distinct marked entities. These are: "folic acid", "muscimol", "bicuculline", "baclofen", "hydroxysaclofen" and "valproic acid". I can assure you that 3 out the 6 entities do exist in my dictionary (folic acid, baclofen, valproic acid). The rest don't (muscimol, bicuculline, hydroxysaclofen). So let's do the math:

        precision should be 100% -->all 3 entities it returns are indeed drugs. That is always the case with dictionaries!
        recall should be 50% --> it found 3 but should have found 6 --> 3/6 = 0.5
        f-score should be 66% --> (2PR) / (P+R) = (2*1*0.5) / (1+0.5) = 1 / 1.5 = 0.666666666666667

        I know i said that the new code does not return 0, 0, -1 anymore but whenever i try to evaluate the dictionary using the above small paragraph it returns 0, 0 , -1 again!!! I'm asking it to find the entities from the same UN-annotated paragraph and it does find the 3 out of 6 expected entities! If i try to evaluate it using my whole corpus i get numbers like 3.92132548478978E-4 everywhere which at first seemed normal but after a bit of thinking it does not anymore...even on the big corpus precision should be 100%. ok, recall on the big-corpus could easily be very low but not precision...and also why evaluating the dictionary on the small paragraph brings back 0, 0, -1 again?

        I'd like to hear your thoughts on this...Do you think these 2 behaviors can be related?

        Jim

        Show
        Jim Piliouras added a comment - Hey William, I think there is a problem....perhaps the issue needs to be reopened. Let me explain... Basically, i am getting such low numbers from the evaluator that made me suspicious...A big part of my corpus (roughly 40 %) was annotated automatically using entries from the dictionary. This means that at least on that part of the corpus the dictionary should have 100% recall because all the annotations exist in the dictionary (they were made from it!!!). Also if you think about it, precision should always be 100% for the DictionaryNameFinder simply because it is very unlikely that a dictionary will contain wrong entries. For example i'm using drugBank.xml...there is no way that it contains entries which are not drugs. In other words the dictionary will never make a mistake...of course it can miss some because no one can guarantee that any dictionary will be complete but that should only affect recall and not precision. Let me elaborate on how i tested it: created the following paragraph for testing (sentences on separate lines): ------------------------------------------------------------- <START:drug> Folic acid <END> is one variable, but other factors remain. Studies suggest that substances active at the GABA receptor may produce NTDs. To test this hypothesis pregnant rats were exposed to either the GABA a agonist <START:drug> muscimol <END> (1, 2 or 4 mg/kg), the GABA a antagonist <START:drug> bicuculline <END> (.5, 1, or 2 mg/kg), the GABA b agonist <START:drug> baclofen <END> (15, 30, 60 mg/kg), or the GABA b antagonist <START:drug> hydroxysaclofen <END> (1, 3, or 5 mg/kg) during neural tube formation. Normal saline was used as a control and <START:drug> valproic acid <END> (600 mg/kg) as a positive control. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Now, bear with me for a second...You probably noticed that there are 6 distinct marked entities. These are: "folic acid", "muscimol", "bicuculline", "baclofen", "hydroxysaclofen" and "valproic acid". I can assure you that 3 out the 6 entities do exist in my dictionary (folic acid, baclofen, valproic acid). The rest don't (muscimol, bicuculline, hydroxysaclofen). So let's do the math: precision should be 100% -->all 3 entities it returns are indeed drugs. That is always the case with dictionaries! recall should be 50% --> it found 3 but should have found 6 --> 3/6 = 0.5 f-score should be 66% --> (2PR) / (P+R) = (2*1*0.5) / (1+0.5) = 1 / 1.5 = 0.666666666666667 I know i said that the new code does not return 0, 0, -1 anymore but whenever i try to evaluate the dictionary using the above small paragraph it returns 0, 0 , -1 again!!! I'm asking it to find the entities from the same UN-annotated paragraph and it does find the 3 out of 6 expected entities! If i try to evaluate it using my whole corpus i get numbers like 3.92132548478978E-4 everywhere which at first seemed normal but after a bit of thinking it does not anymore...even on the big corpus precision should be 100%. ok, recall on the big-corpus could easily be very low but not precision...and also why evaluating the dictionary on the small paragraph brings back 0, 0, -1 again? I'd like to hear your thoughts on this...Do you think these 2 behaviors can be related? Jim
        Hide
        Jim Piliouras added a comment -

        Issue addressed

        Show
        Jim Piliouras added a comment - Issue addressed
        Hide
        William Colen added a comment -

        In this case you can close the issue. I think you will have better results when we fix OPENNLP-471.

        Show
        William Colen added a comment - In this case you can close the issue. I think you will have better results when we fix OPENNLP-471 .
        Hide
        Jim Piliouras added a comment -

        Yep it works just fine now... i do get ridiculously small numbers but i
        guess that is to expected...I can confirm that it no longer returns 0, 0
        , -1.
        Thanks William...

        Jim

        Show
        Jim Piliouras added a comment - Yep it works just fine now... i do get ridiculously small numbers but i guess that is to expected...I can confirm that it no longer returns 0, 0 , -1. Thanks William... Jim
        Hide
        William Colen added a comment -

        Fixed this issue, but found some other related issues that will be addressed in another Jira.
        Jim, can you please check if it works now?
        Thank you

        Show
        William Colen added a comment - Fixed this issue, but found some other related issues that will be addressed in another Jira. Jim, can you please check if it works now? Thank you

          People

          • Assignee:
            William Colen
            Reporter:
            Jim Piliouras
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development