Lucene - Core
  1. Lucene - Core
  2. LUCENE-2229

SimpleSpanFragmenter fails to start a new fragment

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 5.4.1
    • Component/s: modules/highlighter
    • Labels:
      None

      Description

      SimpleSpanFragmenter fails to identify a new fragment when there is more than one stop word after a span is detected. This problem can be observed when the Query contains a PhraseQuery.

      The problem is that the span extends toward the end of the TokenGroup. This is because waitForProps = positionSpans.get.end + 1; and position += posIncAtt.getPositionIncrement(); this generates a value of position greater than the value of waitForProps and (waitForPos == position) never matches.

      SimpleSpanFragmenter.java
        public boolean isNewFragment() {
          position += posIncAtt.getPositionIncrement();
      
          if (waitForPos == position) {
            waitForPos = -1;
          } else if (waitForPos != -1) {
            return false;
          }
      
          WeightedSpanTerm wSpanTerm = queryScorer.getWeightedSpanTerm(termAtt.term());
      
          if (wSpanTerm != null) {
            List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans();
      
            for (int i = 0; i < positionSpans.size(); i++) {
              if (positionSpans.get(i).start == position) {
                waitForPos = positionSpans.get(i).end + 1;
                break;
              }
            }
          }
         ...
      

      An example is provided in the test case for the following Document and the query "all tokens" followed by the words of a.

      Document

      "Attribute instances are reused for all tokens of a document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it retuns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances."

      HighlighterTest.java
      
       public void testSimpleSpanFragmenter() throws Exception {
      
          ...
      
          doSearching("\"all tokens\"");
      
          maxNumFragmentsRequired = 2;
          
          scorer = new QueryScorer(query, FIELD_NAME);
          highlighter = new Highlighter(this, scorer);
      
          for (int i = 0; i < hits.totalHits; i++) {
            String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
            TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));
      
            highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20));
      
            String result = highlighter.getBestFragments(tokenStream, text,
                maxNumFragmentsRequired, "...");
            System.out.println("\t" + result);
      
          }
        }
      
      Result

      are reused for <B>all</B> <B>tokens</B> of a document. Thus, a TokenStream/-Filter needs to update the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the Attributes and then calls incrementToken() again until it retuns false, which indicates that the end of the stream was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in the Attribute instances.

      Expected Result

      for <B>all</B> <B>tokens</B> of a document

      1. LUCENE-2229.patch
        3 kB
        Lukhnos Liu
      2. LUCENE-2229.patch
        4 kB
        Lukhnos Liu
      3. LUCENE-2229.patch
        3 kB
        Elmer Garduno

        Activity

        Hide
        Elmer Garduno added a comment -

        Test case to reproduce this error.

        Show
        Elmer Garduno added a comment - Test case to reproduce this error.
        Hide
        Elmer Garduno added a comment -

        This patch fixes the problem, and adds a test case.

        Show
        Elmer Garduno added a comment - This patch fixes the problem, and adds a test case.
        Hide
        David Smiley added a comment -

        This is still a problem and the patch works (so says a user "lukhnos" on IRC)

        Show
        David Smiley added a comment - This is still a problem and the patch works (so says a user "lukhnos" on IRC)
        Hide
        Lukhnos Liu added a comment -

        Updated patch that works against the current trunk.

        Show
        Lukhnos Liu added a comment - Updated patch that works against the current trunk.
        Hide
        Lukhnos Liu added a comment - - edited

        Chatted with David Smiley on #lucene-dev about this bug. This bug still exists in Lucene 5.x as well as the current trunk.

        I've updated Elmer Garduno's patch and am attaching it here. I tested it with the current trunk (r1721808) and the patch will fix the issue. I've been using the original patch in my day job (on Lucene 5.3.0), and the patch should also work in the 5.x branch.

        The bug is a boundary condition that doesn't manifest itself all the time. For example, if you loop through the test case in the patch with different fragment lengths (say from 30-200), some of the short lengths don't show the issue (because the fragment may end at the first highlighted term when the requested length is short), but most of the lengths do. If you print out the isNewFragment() method in question, you'll find it looks like this when the bug shows up:

        term: instances, position: 0, waitForPos: -1
        term: reused, position: 2, waitForPos: -1
        term: all, position: 4, waitForPos: -1
        term: tokens, position: 5, waitForPos: -1
        term: document, position: 8, waitForPos: 7  # uh oh
        term: thus, position: 9, waitForPos: 7  # and all the following tokens are included in the fragment
        term: tokenstream, position: 11, waitForPos: 7
        term: filter, position: 12, waitForPos: 7
        term: needs, position: 13, waitForPos: 7
        term: update, position: 15, waitForPos: 7
        ...
        

        Recall that the example document reads Attribute instances are reused for all tokens of a document. ... and "of" and "a" are stop words, and so there is a offset-by-2 gap between the positions of "tokens" and "document".

        From the printout, position is way over waitForPos from the token "document" on and therefore isNewFragment() never gets to start a new one.

        After the fix, the progression changes:

        term: instances, position: 0, waitForPos: -1
        term: reused, position: 2, waitForPos: -1
        term: all, position: 4, waitForPos: -1
        term: tokens, position: 5, waitForPos: -1
        term: document, position: 8, waitForPos: 7  # reset
        term: thus, position: 9, waitForPos: -1
        term: tokenstream, position: 11, waitForPos: -1
        term: filter, position: 12, waitForPos: -1
        term: needs, position: 13, waitForPos: -1
        term: update, position: 15, waitForPos: -1
        ...
        

        So it now gets to reset waitForPos and fixes the issue.

        The updated test can be run with ant test -Dtests.class=org.apache.lucene.search.highlight.HighlighterTest

        Show
        Lukhnos Liu added a comment - - edited Chatted with David Smiley on #lucene-dev about this bug. This bug still exists in Lucene 5.x as well as the current trunk. I've updated Elmer Garduno 's patch and am attaching it here. I tested it with the current trunk (r1721808) and the patch will fix the issue. I've been using the original patch in my day job (on Lucene 5.3.0), and the patch should also work in the 5.x branch. The bug is a boundary condition that doesn't manifest itself all the time. For example, if you loop through the test case in the patch with different fragment lengths (say from 30-200), some of the short lengths don't show the issue (because the fragment may end at the first highlighted term when the requested length is short), but most of the lengths do. If you print out the isNewFragment() method in question, you'll find it looks like this when the bug shows up: term: instances, position: 0, waitForPos: -1 term: reused, position: 2, waitForPos: -1 term: all, position: 4, waitForPos: -1 term: tokens, position: 5, waitForPos: -1 term: document, position: 8, waitForPos: 7 # uh oh term: thus, position: 9, waitForPos: 7 # and all the following tokens are included in the fragment term: tokenstream, position: 11, waitForPos: 7 term: filter, position: 12, waitForPos: 7 term: needs, position: 13, waitForPos: 7 term: update, position: 15, waitForPos: 7 ... Recall that the example document reads Attribute instances are reused for all tokens of a document. ... and "of" and "a" are stop words, and so there is a offset-by-2 gap between the positions of "tokens" and "document". From the printout, position is way over waitForPos from the token "document" on and therefore isNewFragment() never gets to start a new one. After the fix, the progression changes: term: instances, position: 0, waitForPos: -1 term: reused, position: 2, waitForPos: -1 term: all, position: 4, waitForPos: -1 term: tokens, position: 5, waitForPos: -1 term: document, position: 8, waitForPos: 7 # reset term: thus, position: 9, waitForPos: -1 term: tokenstream, position: 11, waitForPos: -1 term: filter, position: 12, waitForPos: -1 term: needs, position: 13, waitForPos: -1 term: update, position: 15, waitForPos: -1 ... So it now gets to reset waitForPos and fixes the issue. The updated test can be run with ant test -Dtests.class=org.apache.lucene.search.highlight.HighlighterTest
        Hide
        David Smiley added a comment -

        Looks good to me Lukhnos. Thanks for the detailed explanation. I plan to commit this tomorrow.

        I'll also adjust the mistaken javadoc comment claiming fragmentSize is measured in bytes – it's actually characters. And I'll update the for loop to use Java 5 style.

        Show
        David Smiley added a comment - Looks good to me Lukhnos. Thanks for the detailed explanation. I plan to commit this tomorrow. I'll also adjust the mistaken javadoc comment claiming fragmentSize is measured in bytes – it's actually characters. And I'll update the for loop to use Java 5 style.
        Hide
        Lukhnos Liu added a comment -

        Updated with the svn-formatted patch (the previous one was git-formatted)

        Show
        Lukhnos Liu added a comment - Updated with the svn-formatted patch (the previous one was git-formatted)
        Hide
        ASF subversion and git services added a comment -

        Commit 1722241 from David Smiley in branch 'dev/trunk'
        [ https://svn.apache.org/r1722241 ]

        LUCENE-2229: Fix SimpleSpanFragmenter bug with adjacent stop-words

        Show
        ASF subversion and git services added a comment - Commit 1722241 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1722241 ] LUCENE-2229 : Fix SimpleSpanFragmenter bug with adjacent stop-words
        Hide
        ASF subversion and git services added a comment -

        Commit 1722242 from David Smiley in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1722242 ]

        LUCENE-2229: Fix SimpleSpanFragmenter bug with adjacent stop-words

        Show
        ASF subversion and git services added a comment - Commit 1722242 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1722242 ] LUCENE-2229 : Fix SimpleSpanFragmenter bug with adjacent stop-words
        Hide
        David Smiley added a comment -

        Thanks Elmer & Lukhnos!

        Show
        David Smiley added a comment - Thanks Elmer & Lukhnos!
        Hide
        ASF subversion and git services added a comment -

        Commit 1724095 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1724095 ]

        LUCENE-2229: Move CHANGES entry to 5.4.1.

        Show
        ASF subversion and git services added a comment - Commit 1724095 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1724095 ] LUCENE-2229 : Move CHANGES entry to 5.4.1.
        Hide
        ASF subversion and git services added a comment -

        Commit 1724096 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1724096 ]

        LUCENE-2229: Move CHANGES entry to 5.4.1.

        Show
        ASF subversion and git services added a comment - Commit 1724096 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1724096 ] LUCENE-2229 : Move CHANGES entry to 5.4.1.
        Hide
        ASF subversion and git services added a comment -

        Commit 1724097 from Adrien Grand in branch 'dev/branches/lucene_solr_5_4'
        [ https://svn.apache.org/r1724097 ]

        LUCENE-2229: Fix SimpleSpanFragmenter bug with adjacent stop-words

        Show
        ASF subversion and git services added a comment - Commit 1724097 from Adrien Grand in branch 'dev/branches/lucene_solr_5_4' [ https://svn.apache.org/r1724097 ] LUCENE-2229 : Fix SimpleSpanFragmenter bug with adjacent stop-words
        Hide
        ASF subversion and git services added a comment -

        Commit 1724175 from Adrien Grand in branch 'dev/branches/lucene_solr_5_4'
        [ https://svn.apache.org/r1724175 ]

        LUCENE-2229: Add missing CHANGES entry.

        Show
        ASF subversion and git services added a comment - Commit 1724175 from Adrien Grand in branch 'dev/branches/lucene_solr_5_4' [ https://svn.apache.org/r1724175 ] LUCENE-2229 : Add missing CHANGES entry.

          People

          • Assignee:
            David Smiley
            Reporter:
            Elmer Garduno
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development