Lucene - Core
  1. Lucene - Core
  2. LUCENE-6139

TokenGroup.getStart|EndOffset should return matchStart|EndOffset not start|endOffset

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The default highlighter has a TokenGroup class that is passed to Formatter.highlightTerm(). TokenGroup also has getStartOffset() and getEndOffset() methods that ostensibly return the start and end offsets into the original text of the current term. These getters aren't called by Lucene or Solr but they are made available and are useful to me. The problem is that they return the wrong offsets when there are tokens at the same position. I believe this was an oversight of LUCENE-627 in which these getters should have been updated but weren't. The fix is simple: return matchStartOffset and matchEndOffset from these getters, not startOffset and endOffset. I think this oversight would not have occurred if Highlighter didn't have package-access to TokenGroup's fields.

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          I propose that TokenGroup's fields become private and Highlighter access them via it's getters – the ones it already has, actually, no need for more.

          This begs the question if the distinction of a "matchStartOffset" vs. "startOffset" (and "end" variants) serves any purpose. That is, toss startOffset (& endOffset) then rename matchStartOffset (& matchEndOffset) to startOffset (& endOffset). They aren't used, and I doubt others are because I think the offset info, when needed, is accessed at the end via TextFragment (populated from TokenGroup.matchStartOffset & matchEndOffset). FYI I didn't go that route because I want all matches and I found the custom Formatter approach to be more appealing than passing a very large numFragments, from an efficiency standpoint.

          Unrelated questions about Highlighter

          Not directly related to this is a couple burning questions I have in Highlighter:

          • Why oh why does Highlighter call formatter.highlightTerm for essentially every token? If TokenGroup.getTotalScore() is 0, I argue it shouldn't. All the built-in Fragmenters (and one I just wrote) start with a zero score short-circuit.
          • Why does a 0-score fragment remains a part of the fragments priority queue; why it isn't tossed out when the fragment closes out? One might argue it's needless when numFragments is small, which is the size of the PQ but it'd be nice to ask for 'all' fragments/matches without a huge PQ even if there is just one real match.
          • Why is all text run through the encoder and appended to a "newText" StringBuilder, even when the fragment has no score? If there's no point then it's a waste to do it and then not use it as it won't be a part of a returned fragment. Again, I think 0-score fragments should be immediately dropped, and newText should only be for the current fragment.
          Show
          David Smiley added a comment - I propose that TokenGroup's fields become private and Highlighter access them via it's getters – the ones it already has, actually, no need for more. This begs the question if the distinction of a "matchStartOffset" vs. "startOffset" (and "end" variants) serves any purpose. That is, toss startOffset (& endOffset) then rename matchStartOffset (& matchEndOffset) to startOffset (& endOffset). They aren't used, and I doubt others are because I think the offset info, when needed, is accessed at the end via TextFragment (populated from TokenGroup.matchStartOffset & matchEndOffset). FYI I didn't go that route because I want all matches and I found the custom Formatter approach to be more appealing than passing a very large numFragments, from an efficiency standpoint. Unrelated questions about Highlighter Not directly related to this is a couple burning questions I have in Highlighter: Why oh why does Highlighter call formatter.highlightTerm for essentially every token? If TokenGroup.getTotalScore() is 0, I argue it shouldn't. All the built-in Fragmenters (and one I just wrote) start with a zero score short-circuit. Why does a 0-score fragment remains a part of the fragments priority queue; why it isn't tossed out when the fragment closes out? One might argue it's needless when numFragments is small, which is the size of the PQ but it'd be nice to ask for 'all' fragments/matches without a huge PQ even if there is just one real match. Why is all text run through the encoder and appended to a "newText" StringBuilder, even when the fragment has no score? If there's no point then it's a waste to do it and then not use it as it won't be a part of a returned fragment. Again, I think 0-score fragments should be immediately dropped, and newText should only be for the current fragment.
          Hide
          David Smiley added a comment -

          Any opinions Mark Harwood? You were involved in LUCENE-627 and seemingly some of the development of this highlighter, back in the day.

          p.s. it's difficult to figure out which of your three JIRA accounts one should use when referencing you. I wish JIRA had a means of consolidating or indicating which are defunct. I've got 2.

          Show
          David Smiley added a comment - Any opinions Mark Harwood ? You were involved in LUCENE-627 and seemingly some of the development of this highlighter, back in the day. p.s. it's difficult to figure out which of your three JIRA accounts one should use when referencing you. I wish JIRA had a means of consolidating or indicating which are defunct. I've got 2.
          Hide
          David Smiley added a comment -

          The attached patch makes TokenGroup's fields private to force Highlighter to use the getters, and I made getStartOffset return matchStartOffset (likewise for 'end'). And I added docs.

          I experimented with not having the match vs. not distinction on the internal state in terms of getter and field. The moment a matching token (score > 0) was added to the group, the semantics of the start & end offset were constrained to be limited to just the matching token(s). But a boolean query for "hi" and "speed" failed in testOverlapAnalyzer2 to match what the test was told it should be. IMO the new behavior was totally acceptable, but I'm weary of introducing anything that would change the results without peer review. So I'll commit this instead in a couple days if there are no further comments.

          Show
          David Smiley added a comment - The attached patch makes TokenGroup's fields private to force Highlighter to use the getters, and I made getStartOffset return matchStartOffset (likewise for 'end'). And I added docs. I experimented with not having the match vs. not distinction on the internal state in terms of getter and field. The moment a matching token (score > 0) was added to the group, the semantics of the start & end offset were constrained to be limited to just the matching token(s). But a boolean query for "hi" and "speed" failed in testOverlapAnalyzer2 to match what the test was told it should be. IMO the new behavior was totally acceptable , but I'm weary of introducing anything that would change the results without peer review. So I'll commit this instead in a couple days if there are no further comments.
          Hide
          ASF subversion and git services added a comment -

          Commit 1649263 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1649263 ]

          LUCENE-6139: TokenGroup start/end offset getters should have been returning offsets of matching tokens when there are some.

          Also made the Highlighter use the getters instead of direct field access.

          Show
          ASF subversion and git services added a comment - Commit 1649263 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1649263 ] LUCENE-6139 : TokenGroup start/end offset getters should have been returning offsets of matching tokens when there are some. Also made the Highlighter use the getters instead of direct field access.
          Hide
          ASF subversion and git services added a comment -

          Commit 1649264 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1649264 ]

          LUCENE-6139: TokenGroup start/end offset getters should have been returning offsets of matching tokens when there are some.

          Also made the Highlighter use the getters instead of direct field access.

          Show
          ASF subversion and git services added a comment - Commit 1649264 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1649264 ] LUCENE-6139 : TokenGroup start/end offset getters should have been returning offsets of matching tokens when there are some. Also made the Highlighter use the getters instead of direct field access.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development