Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-562

invoking .find() on a RegexNameFinder instance brings back Spans with identical start/end indices

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: tools-1.5.2-incubating
    • Fix Version/s: tools-1.5.3
    • Component/s: Name Finder
    • Labels:
    • Environment:
      Ubuntu 12.10 64-bit Java 7 u11

      Description

      The RegexNameFinder class has a serious bug...Whenever it finds something it produces a Span with the same start/end index. This happens because 'sentencePosTokenMap' stores the same position for the start and end of the token.Conceptually this fine, after all it is the same token, however later on matcher.start()/end() is invoked to determine what to ask from the map.Well, if we've stored the same position we will get the same number and the Span will be ruined, right? The trick here is to store i+1 for the endIndex for that token in the map. That is essentially the position of next token, but since we're expecting tokenized text anyway everything is fine...Untokenized text breaks the system anyway so in my opinion it is safe to apply the forthcoming patch. A dirty approach would be to leave the map as is and simply replace 'matcher.end()' with 'matcher.end()+1' when we're doing the lookup.

        Attachments

        1. OPENNLP-562.patch
          3 kB
          Jim Piliouras

          Activity

            People

            • Assignee:
              jkosin James Kosin
              Reporter:
              jim-85 Jim Piliouras
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified