Lucene - Core
  1. Lucene - Core
  2. LUCENE-627

highlighter problems with overlapping tokens

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.1
    • Component/s: core/other
    • Labels:
      None

      Description

      The lucene highlighter has problems when tokens that overlap are generated.

      For example, if analysis of iPod generates the tokens "i", "pod", "ipod" (with pod and ipod in the same position),
      then the highlighter will output this as iipod, regardless of if any of those tokens are highlighted.

      Discovered via http://issues.apache.org/jira/browse/SOLR-24

      1. Highlighter.java.diff
        2 kB
        Kerang Lv
      2. highlight_overlap.diff
        18 kB
        Yonik Seeley

        Activity

        Hide
        Mark Harwood added a comment -

        Works OK for me - just added this to the Highlighter's Junit test and all is well.
        Something up with your analyzer?

        public void testOverlapAnalyzer2() throws Exception

        { HashMap synonyms = new HashMap(); synonyms.put("ipod", "i,pod"); Analyzer analyzer = new SynonymAnalyzer(synonyms); String srchkey = "ipod"; String s = "I want an ipod"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(s)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(tokenStream, s, 3, "..."); String expectedResult="I want an <B>ipod</B>"; assertEquals(expectedResult,result); }
        Show
        Mark Harwood added a comment - Works OK for me - just added this to the Highlighter's Junit test and all is well. Something up with your analyzer? public void testOverlapAnalyzer2() throws Exception { HashMap synonyms = new HashMap(); synonyms.put("ipod", "i,pod"); Analyzer analyzer = new SynonymAnalyzer(synonyms); String srchkey = "ipod"; String s = "I want an ipod"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(s)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(tokenStream, s, 3, "..."); String expectedResult="I want an <B>ipod</B>"; assertEquals(expectedResult,result); }
        Hide
        Yonik Seeley added a comment -

        The start and end offsets also overlap... I wonder if that's what is different.

        Solr's analyzer output:
        http://localhost:8983/solr/admin/analysis.jsp?name=text&verbose=on&val=iPod

        term position 1, 2
        term text i, pod/ipod
        term type word, word/word
        source start,end (0,1), (1,4)/(0,4)

        I'll try and come up with a junit test that demonstrates it.

        Show
        Yonik Seeley added a comment - The start and end offsets also overlap... I wonder if that's what is different. Solr's analyzer output: http://localhost:8983/solr/admin/analysis.jsp?name=text&verbose=on&val=iPod term position 1, 2 term text i, pod/ipod term type word, word/word source start,end (0,1), (1,4)/(0,4) I'll try and come up with a junit test that demonstrates it.
        Hide
        Yonik Seeley added a comment -

        OK, here we go... the following test fails with
        Expected :ipod <B>foo</B>
        Actual :iiPod <B>foo</B>

        public void testOverlapAnalyzer2() throws Exception
        {

        String s = "iPod foo";
        // the token stream for the string above:
        TokenStream ts = new TokenStream() {
        Iterator iter;

        { List lst = new ArrayList(); Token t; t = new Token("i",0,1); lst.add(t); t = new Token("pod",1,4); t.setPositionIncrement(0); lst.add(t); t = new Token("ipod",0,4); lst.add(t); t = new Token("foo",5,8); lst.add(t); iter = lst.iterator(); }

        public Token next() throws IOException

        { return iter.hasNext() ? (Token)iter.next() : null; }

        };

        String srchkey = "foo";

        QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
        Query query = parser.parse(srchkey);

        Highlighter highlighter = new Highlighter(new QueryScorer(query));

        // Get 3 best fragments and seperate with a "..."
        String result = highlighter.getBestFragments(ts, s, 3, "...");
        String expectedResult="ipod <B>foo</B>";
        assertEquals(expectedResult,result);
        }

        Show
        Yonik Seeley added a comment - OK, here we go... the following test fails with Expected :ipod <B>foo</B> Actual :iiPod <B>foo</B> public void testOverlapAnalyzer2() throws Exception { String s = "iPod foo"; // the token stream for the string above: TokenStream ts = new TokenStream() { Iterator iter; { List lst = new ArrayList(); Token t; t = new Token("i",0,1); lst.add(t); t = new Token("pod",1,4); t.setPositionIncrement(0); lst.add(t); t = new Token("ipod",0,4); lst.add(t); t = new Token("foo",5,8); lst.add(t); iter = lst.iterator(); } public Token next() throws IOException { return iter.hasNext() ? (Token)iter.next() : null; } }; String srchkey = "foo"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(ts, s, 3, "..."); String expectedResult="ipod <B>foo</B>"; assertEquals(expectedResult,result); }
        Hide
        Mark Harwood added a comment -

        The problem appears to be because the "pod" token advances the start position to 1 while the next token "ipod" takes a step back (to 0)

        I've found if you just arrange the tokens to be emitted in start pos order all is fine - see below

        public void testOverlapAnalyzer2() throws Exception
        {

        String s = "iPod foo";
        // the token stream for the string above:
        TokenStream ts = new TokenStream() {
        Iterator iter;

        { List lst = new ArrayList(); Token t; //moved this token to start t = new Token("ipod",0,4); lst.add(t); t = new Token("i",0,1); lst.add(t); t = new Token("pod",1,4); t.setPositionIncrement(0); lst.add(t); t = new Token("foo",5,8); lst.add(t); iter = lst.iterator(); }

        public Token next() throws IOException

        { return iter.hasNext() ? (Token)iter.next() : null; }

        };

        String srchkey = "foo";

        QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
        Query query = parser.parse(srchkey);

        Highlighter highlighter = new Highlighter(new QueryScorer(query));

        // Get 3 best fragments and seperate with a "..."
        String result = highlighter.getBestFragments(ts, s, 3, "...");
        //had to upper case the P in the test here
        String expectedResult="iPod <B>foo</B>";
        assertEquals(expectedResult,result);
        }

        Show
        Mark Harwood added a comment - The problem appears to be because the "pod" token advances the start position to 1 while the next token "ipod" takes a step back (to 0) I've found if you just arrange the tokens to be emitted in start pos order all is fine - see below public void testOverlapAnalyzer2() throws Exception { String s = "iPod foo"; // the token stream for the string above: TokenStream ts = new TokenStream() { Iterator iter; { List lst = new ArrayList(); Token t; //moved this token to start t = new Token("ipod",0,4); lst.add(t); t = new Token("i",0,1); lst.add(t); t = new Token("pod",1,4); t.setPositionIncrement(0); lst.add(t); t = new Token("foo",5,8); lst.add(t); iter = lst.iterator(); } public Token next() throws IOException { return iter.hasNext() ? (Token)iter.next() : null; } }; String srchkey = "foo"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(ts, s, 3, "..."); //had to upper case the P in the test here String expectedResult="iPod <B>foo</B>"; assertEquals(expectedResult,result); }
        Hide
        Mark Harwood added a comment -

        Position increments wrong in my above code but the solution of sequencing start offsets correctly still holds true:

        t = new Token("ipod",0,4);
        t.setPositionIncrement(0);
        lst.add(t);
        t = new Token("i",0,1);
        t.setPositionIncrement(0);
        lst.add(t);
        t = new Token("pod",1,4);
        lst.add(t);
        t = new Token("foo",5,8);
        lst.add(t);

        Show
        Mark Harwood added a comment - Position increments wrong in my above code but the solution of sequencing start offsets correctly still holds true: t = new Token("ipod",0,4); t.setPositionIncrement(0); lst.add(t); t = new Token("i",0,1); t.setPositionIncrement(0); lst.add(t); t = new Token("pod",1,4); lst.add(t); t = new Token("foo",5,8); lst.add(t);
        Hide
        Yonik Seeley added a comment -

        The original token stream is a valid one though right? If so, it seems like the fix should be in the highlighter module.

        Show
        Yonik Seeley added a comment - The original token stream is a valid one though right? If so, it seems like the fix should be in the highlighter module.
        Hide
        Mark Harwood added a comment -

        >>The original token stream is a valid one though right?

        I don't think so, see below...

        List lst = new ArrayList();
        Token t;
        t = new Token("i",0,1);
        lst.add(t);
        t = new Token("pod",1,4);
        t.setPositionIncrement(0);
        lst.add(t);
        t = new Token("ipod",0,4);
        !! Missing a t.setPositionIncrement(0) here.
        lst.add(t);
        t = new Token("foo",5,8);
        lst.add(t);
        iter = lst.iterator();

        Having fixed the above I believe this change below is all that is required to fix the highlighter:

        TokenGroup.java

        boolean isDistinct(Token token)

        { // return token.startOffset()>=endOffset; return token.getPositionIncrement()>0; }

        All my Junit tests pass with this change - can you verify this is true for you too?
        This change would break highlighting for any analyzers that had position increments but whose offsets somehow overlapped - text would potentially be duplicated in the same way you originally reported your problem. I can't verify this to be true for CJK analyzers etc so feel a little uneasy about committing this.

        Cheers
        Mark

        Show
        Mark Harwood added a comment - >>The original token stream is a valid one though right? I don't think so, see below... List lst = new ArrayList(); Token t; t = new Token("i",0,1); lst.add(t); t = new Token("pod",1,4); t.setPositionIncrement(0); lst.add(t); t = new Token("ipod",0,4); !! Missing a t.setPositionIncrement(0) here. lst.add(t); t = new Token("foo",5,8); lst.add(t); iter = lst.iterator(); Having fixed the above I believe this change below is all that is required to fix the highlighter: TokenGroup.java boolean isDistinct(Token token) { // return token.startOffset()>=endOffset; return token.getPositionIncrement()>0; } All my Junit tests pass with this change - can you verify this is true for you too? This change would break highlighting for any analyzers that had position increments but whose offsets somehow overlapped - text would potentially be duplicated in the same way you originally reported your problem. I can't verify this to be true for CJK analyzers etc so feel a little uneasy about committing this. Cheers Mark
        Hide
        Yonik Seeley added a comment -

        >>The original token stream is a valid one though right?
        > I don't think so, see below...

        Ah, right... I constructed the wrong one first. I wanted pod and ipod in the same position... so the token stream looks like "i" ("pod"|"ipod") "foo".
        Now this token-stream is correct, I believe, but the same problem happens.

        A work-around is to swap the order that "pod" and "ipod" tokens appear, but it seems like any such workaround should be put into the highlighter rather than external to it.

        public void testOverlapAnalyzer2() throws Exception
        {

        String s = "iPod foo";
        // the token stream for the string above:
        TokenStream ts = new TokenStream() {
        Iterator iter;

        { List lst = new ArrayList(); Token t; t = new Token("i",0,1); lst.add(t); t = new Token("pod",1,4); lst.add(t); t = new Token("ipod",0,4); t.setPositionIncrement(0); // pod and ipod occupy the same token position. lst.add(t); t = new Token("foo",5,8); lst.add(t); iter = lst.iterator(); }

        public Token next() throws IOException

        { return iter.hasNext() ? (Token)iter.next() : null; }

        };

        String srchkey = "foo";

        QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
        Query query = parser.parse(srchkey);

        Highlighter highlighter = new Highlighter(new QueryScorer(query));

        // Get 3 best fragments and seperate with a "..."
        String result = highlighter.getBestFragments(ts, s, 3, "...");
        String expectedResult="iPod <B>foo</B>";
        assertEquals(expectedResult,result);
        }

        Show
        Yonik Seeley added a comment - >>The original token stream is a valid one though right? > I don't think so, see below... Ah, right... I constructed the wrong one first. I wanted pod and ipod in the same position... so the token stream looks like "i" ("pod"|"ipod") "foo". Now this token-stream is correct, I believe, but the same problem happens. A work-around is to swap the order that "pod" and "ipod" tokens appear, but it seems like any such workaround should be put into the highlighter rather than external to it. public void testOverlapAnalyzer2() throws Exception { String s = "iPod foo"; // the token stream for the string above: TokenStream ts = new TokenStream() { Iterator iter; { List lst = new ArrayList(); Token t; t = new Token("i",0,1); lst.add(t); t = new Token("pod",1,4); lst.add(t); t = new Token("ipod",0,4); t.setPositionIncrement(0); // pod and ipod occupy the same token position. lst.add(t); t = new Token("foo",5,8); lst.add(t); iter = lst.iterator(); } public Token next() throws IOException { return iter.hasNext() ? (Token)iter.next() : null; } }; String srchkey = "foo"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(ts, s, 3, "..."); String expectedResult="iPod <B>foo</B>"; assertEquals(expectedResult,result); }
        Hide
        Yonik Seeley added a comment -

        I just tried the change to isDistinct, and it didn't work for the corrected tokenstream I posted.
        I agree that it doesn't seem like the right fix anyway. It seems like things should be based on startOffset and endOffset.

        Show
        Yonik Seeley added a comment - I just tried the change to isDistinct, and it didn't work for the corrected tokenstream I posted. I agree that it doesn't seem like the right fix anyway. It seems like things should be based on startOffset and endOffset.
        Hide
        Yonik Seeley added a comment -

        Here's another fun one

        junit.framework.ComparisonFailure:
        Expected :zooPod<B>Foo</B>
        Actual :zoozooPod<B>zooPodFoo</B>

        public void testOverlapAnalyzer3() throws Exception
        {

        String s = "zooPodFoo";
        // the token stream for the string above:
        TokenStream ts = new TokenStream() {
        Iterator iter;

        { List lst = new ArrayList(); Token t; t = new Token("zoo",0,3); lst.add(t); t = new Token("pod",3,6); lst.add(t); t = new Token("zoopod",0,6); t.setPositionIncrement(0); lst.add(t); t = new Token("foo",6,9); lst.add(t); t = new Token("zoopodfoo",0,9); t.setPositionIncrement(0); lst.add(t); iter = lst.iterator(); }

        public Token next() throws IOException

        { return iter.hasNext() ? (Token)iter.next() : null; }

        };

        String srchkey = "foo";

        QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
        Query query = parser.parse(srchkey);

        Highlighter highlighter = new Highlighter(new QueryScorer(query));

        // Get 3 best fragments and seperate with a "..."
        String result = highlighter.getBestFragments(ts, s, 3, "...");
        String expectedResult="zooPod<B>Foo</B>";
        assertEquals(expectedResult,result);
        }

        Show
        Yonik Seeley added a comment - Here's another fun one junit.framework.ComparisonFailure: Expected :zooPod<B>Foo</B> Actual :zoozooPod<B>zooPodFoo</B> public void testOverlapAnalyzer3() throws Exception { String s = "zooPodFoo"; // the token stream for the string above: TokenStream ts = new TokenStream() { Iterator iter; { List lst = new ArrayList(); Token t; t = new Token("zoo",0,3); lst.add(t); t = new Token("pod",3,6); lst.add(t); t = new Token("zoopod",0,6); t.setPositionIncrement(0); lst.add(t); t = new Token("foo",6,9); lst.add(t); t = new Token("zoopodfoo",0,9); t.setPositionIncrement(0); lst.add(t); iter = lst.iterator(); } public Token next() throws IOException { return iter.hasNext() ? (Token)iter.next() : null; } }; String srchkey = "foo"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(ts, s, 3, "..."); String expectedResult="zooPod<B>Foo</B>"; assertEquals(expectedResult,result); }
        Hide
        Yonik Seeley added a comment -

        It seems like maybe the only way to handle some of this stuff is two passes... one to collect the scores & offsets of matches, and then a second pass to generate the fragments.

        Show
        Yonik Seeley added a comment - It seems like maybe the only way to handle some of this stuff is two passes... one to collect the scores & offsets of matches, and then a second pass to generate the fragments.
        Hide
        Mark Harwood added a comment -

        >>It seems like maybe the only way to handle some of this stuff is two passes

        The highlighter does not expect token positions to "rewind" in this manner. I'm not sure where this ends. Imagine an analyzer, which having considered and emitted tokens for a whole document, chooses to append some tokens positioned which has offsets referencing much earlier sections of the document. (Why, I'm not sure but there's nothing to say this couldn't happen).

        >>It seems like maybe the only way to handle some of this stuff is two passes

        Maybe a special "OrderFixer" TokenStream could be used by to wrap "rewinding" token streams such as yours and then accumulate all tokens in a buffer before then sorting and outputting them in ascending start offset order. If the Highlighter ignored position increment and just used offsets (as it does currently) I suspect all would be OK

        Show
        Mark Harwood added a comment - >>It seems like maybe the only way to handle some of this stuff is two passes The highlighter does not expect token positions to "rewind" in this manner. I'm not sure where this ends. Imagine an analyzer, which having considered and emitted tokens for a whole document, chooses to append some tokens positioned which has offsets referencing much earlier sections of the document. (Why, I'm not sure but there's nothing to say this couldn't happen). >>It seems like maybe the only way to handle some of this stuff is two passes Maybe a special "OrderFixer" TokenStream could be used by to wrap "rewinding" token streams such as yours and then accumulate all tokens in a buffer before then sorting and outputting them in ascending start offset order. If the Highlighter ignored position increment and just used offsets (as it does currently) I suspect all would be OK
        Hide
        Yonik Seeley added a comment -

        Right... I'm not sure the highlighter should be expected to handle all cases, but WordDelimiterFilter doesn't seem that complex or atypical.

        "a-b-c" would be indexed as "a","b","c"/"abc"
        (c and abc occupy the same token position)

        Another thing I ran across is the addition of non-scoring tokens to a TokenGroup... this ends up highlighting the widest token in the token group, rather than the widest that matched. I was able to get around this by checking that score>0 in TokenGroup, but was this indended?

        Show
        Yonik Seeley added a comment - Right... I'm not sure the highlighter should be expected to handle all cases, but WordDelimiterFilter doesn't seem that complex or atypical. "a-b-c" would be indexed as "a","b","c"/"abc" (c and abc occupy the same token position) Another thing I ran across is the addition of non-scoring tokens to a TokenGroup... this ends up highlighting the widest token in the token group, rather than the widest that matched. I was able to get around this by checking that score>0 in TokenGroup, but was this indended?
        Hide
        Yonik Seeley added a comment -

        I'm going with the OrderFixer approach for now... if startOffsets are equal, but endOffsets are not, does it matter which comes first?

        Show
        Yonik Seeley added a comment - I'm going with the OrderFixer approach for now... if startOffsets are equal, but endOffsets are not, does it matter which comes first?
        Hide
        Yonik Seeley added a comment -

        Here is a patch that makes the tests work after tokens are re-ordered.
        I basically keep track of the start and end offsets of what matches, not just of the TokenGroup, so highlighting is more specific.

        Sorry about the messy patch - my IDE collapses the tabs so I don't even see them (I only realize there are tabs after I do a diff).

        Show
        Yonik Seeley added a comment - Here is a patch that makes the tests work after tokens are re-ordered. I basically keep track of the start and end offsets of what matches, not just of the TokenGroup, so highlighting is more specific. Sorry about the messy patch - my IDE collapses the tabs so I don't even see them (I only realize there are tabs after I do a diff).
        Hide
        Yonik Seeley added a comment -

        So Mark, does this patch look OK?
        Without it, even if I order the tokens by startOffset, I get things like
        HiHi-Speed <em>USB</em>

        WordDelimiterFilter (that's what is producing these types of tokens) is widely used in Solr-land, so I'm eager to get this fixed.

        Show
        Yonik Seeley added a comment - So Mark, does this patch look OK? Without it, even if I order the tokens by startOffset, I get things like HiHi-Speed <em>USB</em> WordDelimiterFilter (that's what is producing these types of tokens) is widely used in Solr-land, so I'm eager to get this fixed.
        Hide
        Mark Harwood added a comment -

        Hi Yonik,
        Been away for a little while and just managed to catch up with your changes. Thanks for this.

        Looking at the patch all seems good. I would just change a couple of things:

        1) TokenGroup.getTotalScore should just return your new "tot" variable
        2) TokenGroup.clear() needs to reinitialize "tot" to zero.

        Other than that all looks good to me. Let me know if you're happy and I'll commit it.

        Thanks again

        Show
        Mark Harwood added a comment - Hi Yonik, Been away for a little while and just managed to catch up with your changes. Thanks for this. Looking at the patch all seems good. I would just change a couple of things: 1) TokenGroup.getTotalScore should just return your new "tot" variable 2) TokenGroup.clear() needs to reinitialize "tot" to zero. Other than that all looks good to me. Let me know if you're happy and I'll commit it. Thanks again
        Hide
        Yonik Seeley added a comment -

        > 1) TokenGroup.getTotalScore should just return your new "tot" variable

        OK, I had considered that change before, but because all the vars were package protected rather than private, I wasn't sure if it was safe.

        > 2) TokenGroup.clear() needs to reinitialize "tot" to zero.

        Oops... that one I missed.

        > Other than that all looks good to me. Let me know if you're happy and I'll commit it.

        Looks fine to me... please do!

        Show
        Yonik Seeley added a comment - > 1) TokenGroup.getTotalScore should just return your new "tot" variable OK, I had considered that change before, but because all the vars were package protected rather than private, I wasn't sure if it was safe. > 2) TokenGroup.clear() needs to reinitialize "tot" to zero. Oops... that one I missed. > Other than that all looks good to me. Let me know if you're happy and I'll commit it. Looks fine to me... please do!
        Hide
        Yonik Seeley added a comment -

        Closing... works fine for me now.
        Thanks!

        Show
        Yonik Seeley added a comment - Closing... works fine for me now. Thanks!
        Hide
        Kerang Lv added a comment -

        Hi Yonik,
        I'm trying to add support for some overlapping bigram analyzer, e.g. the CJKAnalyzer(http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/) onto your patch.

        With your patch, the following test fails with:
        Expected :?<B></B><B></B>?
        Actual :?<B>??????</B>

        public void testOverlapAnalyzer4() throws Exception
        {
        String s = "??????????";
        // the token stream for the string above:
        TokenStream ts = new TokenStream() {
        Iterator iter;

        { List lst = new ArrayList(); Token t; t = new Token("??",0,2); lst.add(t); t = new Token("??",1,3); lst.add(t); t = new Token("??",2,4); lst.add(t); t = new Token("??",3,5); lst.add(t); t = new Token("??",4,6); lst.add(t); t = new Token("??",5,7); lst.add(t); t = new Token("??",6,8); lst.add(t); t = new Token("??",7,9); lst.add(t); t = new Token("??",8,10); lst.add(t); iter = lst.iterator(); }

        public Token next() throws IOException

        { return iter.hasNext() ? (Token)iter.next() : null; }

        };

        String srchkey = "?? ??";

        QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
        Query query = parser.parse(srchkey);

        Highlighter highlighter = new Highlighter(new QueryScorer(query));

        // Get 3 best fragments and seperate with a "..."
        String result = highlighter.getBestFragments(ts, s, 3, "...");
        String expectedResult="?<B></B><B></B>?";
        assertEquals(expectedResult,result);
        }

        With some overlapping bigram analyzer, the current token's startOffset is the previous token's endOffset - 1, so the TokenGroup.isDistinct(token) returns false the most time, which lead to bad range tokenText.

        Here is a patch that makes the tests work.

        Show
        Kerang Lv added a comment - Hi Yonik, I'm trying to add support for some overlapping bigram analyzer, e.g. the CJKAnalyzer( http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/ ) onto your patch. With your patch, the following test fails with: Expected :?<B> </B> <B> </B> ? Actual :?<B>??????</B> public void testOverlapAnalyzer4() throws Exception { String s = "??????????"; // the token stream for the string above: TokenStream ts = new TokenStream() { Iterator iter; { List lst = new ArrayList(); Token t; t = new Token("??",0,2); lst.add(t); t = new Token("??",1,3); lst.add(t); t = new Token("??",2,4); lst.add(t); t = new Token("??",3,5); lst.add(t); t = new Token("??",4,6); lst.add(t); t = new Token("??",5,7); lst.add(t); t = new Token("??",6,8); lst.add(t); t = new Token("??",7,9); lst.add(t); t = new Token("??",8,10); lst.add(t); iter = lst.iterator(); } public Token next() throws IOException { return iter.hasNext() ? (Token)iter.next() : null; } }; String srchkey = "?? ??"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(ts, s, 3, "..."); String expectedResult="?<B> </B> <B> </B> ?"; assertEquals(expectedResult,result); } With some overlapping bigram analyzer, the current token's startOffset is the previous token's endOffset - 1, so the TokenGroup.isDistinct(token) returns false the most time, which lead to bad range tokenText. Here is a patch that makes the tests work.
        Hide
        Mark Harwood added a comment -

        Hi Kerang, can you supply a patch/tests against the latest version?

        I committed a change to highlighter on the 16th August with what I beleived to be a fix for this problem: http://issues.apache.org/jira/browse/LUCENE-645

        Looking at the patch you provided it seems to be old code which is missing support for this fix.

        Cheers
        Mark

        Show
        Mark Harwood added a comment - Hi Kerang, can you supply a patch/tests against the latest version? I committed a change to highlighter on the 16th August with what I beleived to be a fix for this problem: http://issues.apache.org/jira/browse/LUCENE-645 Looking at the patch you provided it seems to be old code which is missing support for this fix. Cheers Mark
        Hide
        Kerang Lv added a comment -

        Hi Mark, sorry for the long time missing!

        Here is the test, it fails again with the lastest version (Revision 450719):

        Expected :A<B>BC</B>DE<B>FG</B>HIJ
        Actual:A<B>BCDEFG</B>HIJ

        public void testOverlapAnalyzer4() throws Exception
        {
        String s = "ABCDEFGHIJ";
        // the token stream for the string above:
        TokenStream ts = new TokenStream() {
        Iterator iter;

        { List lst = new ArrayList(); Token t; t = new Token("AB",0,2); lst.add(t); t = new Token("BC",1,3); lst.add(t); t = new Token("CD",2,4); lst.add(t); t = new Token("DE",3,5); lst.add(t); t = new Token("EF",4,6); lst.add(t); t = new Token("FG",5,7); lst.add(t); t = new Token("GH",6,8); lst.add(t); t = new Token("HI",7,9); lst.add(t); t = new Token("IJ",8,10); lst.add(t); iter = lst.iterator(); }

        public Token next() throws IOException

        { return iter.hasNext() ? (Token)iter.next() : null; }


        };

        String srchkey = "BC FG";

        QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
        Query query = parser.parse(srchkey);

        Highlighter highlighter = new Highlighter(new QueryScorer(query));

        // Get 3 best fragments and seperate with a "..."
        String result = highlighter.getBestFragments(ts, s, 3, "...");
        String expectedResult="A<B>BC</B>DE<B>FG</B>HIJ";
        assertEquals(expectedResult,result);
        }

        Show
        Kerang Lv added a comment - Hi Mark, sorry for the long time missing! Here is the test, it fails again with the lastest version (Revision 450719): Expected :A<B>BC</B>DE<B>FG</B>HIJ Actual:A<B>BCDEFG</B>HIJ public void testOverlapAnalyzer4() throws Exception { String s = "ABCDEFGHIJ"; // the token stream for the string above: TokenStream ts = new TokenStream() { Iterator iter; { List lst = new ArrayList(); Token t; t = new Token("AB",0,2); lst.add(t); t = new Token("BC",1,3); lst.add(t); t = new Token("CD",2,4); lst.add(t); t = new Token("DE",3,5); lst.add(t); t = new Token("EF",4,6); lst.add(t); t = new Token("FG",5,7); lst.add(t); t = new Token("GH",6,8); lst.add(t); t = new Token("HI",7,9); lst.add(t); t = new Token("IJ",8,10); lst.add(t); iter = lst.iterator(); } public Token next() throws IOException { return iter.hasNext() ? (Token)iter.next() : null; } }; String srchkey = "BC FG"; QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer()); Query query = parser.parse(srchkey); Highlighter highlighter = new Highlighter(new QueryScorer(query)); // Get 3 best fragments and seperate with a "..." String result = highlighter.getBestFragments(ts, s, 3, "..."); String expectedResult="A<B>BC</B>DE<B>FG</B>HIJ"; assertEquals(expectedResult,result); }
        Hide
        Mark Harwood added a comment -

        Thanks for the test Kerang.

        I no longer have a clear view as to what is expected behaviour here and whether this is a test that needs to pass.

        It seems to conflict with the expected results for Yonik's test method "testOverlapAnalyzer2".
        In that test, (like yours) for a cluster of overlapping tokens with search terms identified at the beginning and end, Yonik expects the whole cluster from search term 1's start offset to search term 2's end offset to be surrounded by one highlight tag. Your test expected 2 tags.

        Who is right?

        This is a snippet from Yonik's test:
        query = new QueryParser("text",new WhitespaceAnalyzer()).parse("hi speed");
        highlighter = new Highlighter(new QueryScorer(query));
        result = highlighter.getBestFragments(getTS2(), s, 3, "...");
        assertEquals("<B>Hi-Speed</B>10 foo",result);

        and yours:

        String srchkey = "BC FG";
        String expectedResult="A<B>BC</B>DE<B>FG</B>HIJ";

        I don't really have an opinion either way so I'll turn it over to you

        Cheers
        Mark

        Show
        Mark Harwood added a comment - Thanks for the test Kerang. I no longer have a clear view as to what is expected behaviour here and whether this is a test that needs to pass. It seems to conflict with the expected results for Yonik's test method "testOverlapAnalyzer2". In that test, (like yours) for a cluster of overlapping tokens with search terms identified at the beginning and end, Yonik expects the whole cluster from search term 1's start offset to search term 2's end offset to be surrounded by one highlight tag. Your test expected 2 tags. Who is right? This is a snippet from Yonik's test: query = new QueryParser("text",new WhitespaceAnalyzer()).parse("hi speed"); highlighter = new Highlighter(new QueryScorer(query)); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("<B>Hi-Speed</B>10 foo",result); and yours: String srchkey = "BC FG"; String expectedResult="A<B>BC</B>DE<B>FG</B>HIJ"; I don't really have an opinion either way so I'll turn it over to you Cheers Mark
        Hide
        Yonik Seeley added a comment -

        I agree with Kerang about the expected behavior (of this specific case at least).
        The test case of mine quoted above was not what I was shooting for, but was an acceptable unintended side-effect of fixing the other cases.

        So I'm fine with this case being changed to
        query = new QueryParser("text",new WhitespaceAnalyzer()).parse("hi speed");
        highlighter = new Highlighter(new QueryScorer(query));
        result = highlighter.getBestFragments(getTS2(), s, 3, "...");
        assertEquals("<B>Hi</B>-<B>Speed</B>10 foo",result);

        Show
        Yonik Seeley added a comment - I agree with Kerang about the expected behavior (of this specific case at least). The test case of mine quoted above was not what I was shooting for, but was an acceptable unintended side-effect of fixing the other cases. So I'm fine with this case being changed to query = new QueryParser("text",new WhitespaceAnalyzer()).parse("hi speed"); highlighter = new Highlighter(new QueryScorer(query)); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("<B>Hi</B>-<B>Speed</B>10 foo",result);
        Hide
        Chris Harris added a comment -

        I'm here two years after the last comment, trying to create a sense of closure for myself regarding the above conversation:

        • It appears that Mark Harwood committed a slightly modified version of Yonik's 2006-07-14 01:41 PM patch in r422031 and r422302.
        • It's not clear to me whether there was any actual code or test suite changes that eventually resulted from Mark, Yonik and Kerang's subsequent discussion about what, exactly, should be considered correct highlighting of multiple tokens.
        Show
        Chris Harris added a comment - I'm here two years after the last comment, trying to create a sense of closure for myself regarding the above conversation: It appears that Mark Harwood committed a slightly modified version of Yonik's 2006-07-14 01:41 PM patch in r422031 and r422302. It's not clear to me whether there was any actual code or test suite changes that eventually resulted from Mark, Yonik and Kerang's subsequent discussion about what, exactly, should be considered correct highlighting of multiple tokens.

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development