Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...

      Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:

      public class TestNGramHighlighter {
      
        public static void main(String[] args) throws Exception {
          Analyzer analyzer = new NGramAnalyzer();
          final String TEXT = "Lucene can make index. Then Lucene can search.";
          final String QUERY = "can";
          QueryParser parser = new QueryParser("f",analyzer);
          Query query = parser.parse(QUERY);
          QueryScorer scorer = new QueryScorer(query,"f");
          Highlighter h = new Highlighter( scorer );
          System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
        }
      
        static class NGramAnalyzer extends Analyzer {
          public TokenStream tokenStream(String field, Reader input) {
            return new NGramTokenizer(input,2,2);
          }
        }
      }
      

      expected output is:
      Lucene <B>can</B> make index. Then Lucene <B>can</B> search.

      but the actual output is:
      Lucene <B>can make index. Then Lucene can</B> search.

      1. LUCENE-1489.patch
        3 kB
        David Bowen
      2. lucene1489.patch
        3 kB
        David Bowen

        Activity

        Koji Sekiguchi created issue -
        Hide
        Chris Harris added a comment -

        As I mentioned on the Solr list, I've discovered similar problems when highlighting with the ShingleFilter. (ShingleFilter does n-gram processing on Tokens, whereas NGramAnalyzer does n-gram processing on characters.) Here's a variation on Koji's demo program that exhibits some problems with ShingleFilter, as well as offering a slightly more textured example of how things work with NGramAnalyzer:

        import org.apache.lucene.analysis.Analyzer;
        import org.apache.lucene.analysis.TokenStream;
        import org.apache.lucene.queryParser.QueryParser;
        import org.apache.lucene.search.Query;
        import org.apache.lucene.search.highlight.QueryScorer;
        import org.apache.lucene.search.highlight.NullFragmenter;
        import org.apache.lucene.search.highlight.Highlighter;
        import org.apache.lucene.analysis.ngram.NGramTokenizer;
        import org.apache.lucene.analysis.shingle.ShingleFilter;
        import org.apache.lucene.analysis.WhitespaceTokenizer;
        import java.io.Reader;
        
        public class Main {
        
            public static void main(String[] args) throws Exception {
                testAnalyzer(new BigramShingleAnalyzer(true), "Bigram shingle analyzer (bigrams and unigrams)");
                testAnalyzer(new NGramAnalyzer(), "Bigram (non-shingle) analyzer (bigrams only)");
            }
        
            static void testAnalyzer(Analyzer analyzer, String analyzerDescription) throws Exception {
                System.out.println("Testing analyzer " + analyzerDescription + "...");
                System.out.println("---------------------------------");
                test(analyzer, "Lucene can index and can search", "Lucene");
                test(analyzer, "Lucene can make an index", "can");
                test(analyzer, "Lucene can index and can search", "can");
                test(analyzer, "Lucene can index can search and can highlight", "can");
                test(analyzer, "Lucene can index can search and can highlight", "+index +search");
                System.out.println();
            }
        
            static void test(Analyzer analyzer, String text, String queryStr) throws Exception {
                QueryParser parser = new QueryParser("f", analyzer);
                Query query = parser.parse(queryStr);
                QueryScorer scorer = new QueryScorer(query, "f");
                Highlighter h = new Highlighter(scorer);
                h.setTextFragmenter(new NullFragmenter()); // We're not testing fragmenter here.
                System.out.println(h.getBestFragment(analyzer, "f", text) + " [query='" + queryStr + "']");
            }
        
            static class NGramAnalyzer extends Analyzer {
                public TokenStream tokenStream(String field, Reader input) {
                    return new NGramTokenizer(input, 2, 2);
                }
            }
        
            static class BigramShingleAnalyzer extends Analyzer {
                boolean outputUnigrams;
        
                public BigramShingleAnalyzer(boolean outputUnigrams) {
                    this.outputUnigrams = outputUnigrams;
                }
        
                public TokenStream tokenStream(String field, Reader input) {
                    ShingleFilter sf = new ShingleFilter(new WhitespaceTokenizer(input));
                    sf.setOutputUnigrams(outputUnigrams);
                    return sf;
                }
            }
        }
        

        Here's the current output, with commentary:

        Testing analyzer Bigram shingle analyzer (bigrams and unigrams)...
        ---------------------------------
        // works ok:
        <B>Lucene</B> can index and can search [query='Lucene']
        // works ok:
        Lucene <B>can</B> make an index [query='can']
        // same as Koji's example:
        Lucene <B>can index and can</B> search [query='can']
        // if there are three matches, they all get bundled into a single highlight:
        Lucene <B>can index can search and can</B> highlight [query='can']
        // it doesn't have to be the same search term that matches:
        Lucene can <B>index can search</B> and can highlight [query='+index +search']
        
        Testing analyzer Bigram (non-shingle) analyzer (bigrams only)...
        ---------------------------------
        // works ok:
        <B>Lucene</B> can index and can search [query='Lucene']
        // is 'an' being treated as a match for 'can'(?):
        Lucene <B>can make an</B> index [query='can']
        // same as Koji's example:
        Lucene <B>can index and can</B> search [query='can']
        // if there are three matches, they all get bundled into a single highlight:
        Lucene <B>can index can search and can</B> highlight [query='can']
        // not sure what' happening here:
        Lucene can <B>index can search and</B> can highlight [query='+index +search']
        

        I'm interested what others think, but for me it makes sense to classify both of these as the same issue. From a high-level perspective, the problem in each case seems to be that Highlighter.getBestTextFragments(TokenStream tokenStream, String text, boolean mergeContiguousFragments, int maxNumFragments) makes use of a TokenGroup abstraction that doesn't really work for the n-gram or the bigram shingle case:

        A TokenGroup is supposed to represent "one, or several overlapping tokens, along with the score(s) and the scope of the original text". (I assume TokenGroup was introduced to deal with synonym filter expansions.) Tokens are determined to overlap or not basically by seeing whether tokenB.startOffset() >= tokenA.endOffset(). (It's slightly more complex than this, but that's approximately what the test in TokenGroup.isDistinct() amounts to.) With the two analyzers under discussion, that criterion basically means that each token "overlaps" with the next.

        In Koji's bigram case, consider how "dogs" would get tokenized:

        "do" (startOffset=0, endOffset=2)
        "og" (startOffset=1, endOffset=3)
        "gs" (startOffset=2, endOffset=4)

        Or in my shingle case, consider how "I love Lucene" would get tokenized:

        "I" (startOffset=0, endOffset=1)
        "I love" (startOffset=0, endOffset=6)
        "love" (startOffset=2, endOffset=6)
        "love Lucene" (startOffset=2, endOffset=13)
        "Lucene" (startOffset=7, endOffset=13)

        In both cases, you never have a token whose startOffset is >= the preceding token's endOffset. So all these tokens are part of the same TokenGroup. That should mean these tokens all "overlap", but that would make for a rather mysterious notion of "overlapping".

        Show
        Chris Harris added a comment - As I mentioned on the Solr list, I've discovered similar problems when highlighting with the ShingleFilter. (ShingleFilter does n-gram processing on Tokens , whereas NGramAnalyzer does n-gram processing on characters .) Here's a variation on Koji's demo program that exhibits some problems with ShingleFilter, as well as offering a slightly more textured example of how things work with NGramAnalyzer: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Query; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.NullFragmenter; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.analysis.ngram.NGramTokenizer; import org.apache.lucene.analysis.shingle.ShingleFilter; import org.apache.lucene.analysis.WhitespaceTokenizer; import java.io.Reader; public class Main { public static void main( String [] args) throws Exception { testAnalyzer( new BigramShingleAnalyzer( true ), "Bigram shingle analyzer (bigrams and unigrams)" ); testAnalyzer( new NGramAnalyzer(), "Bigram (non-shingle) analyzer (bigrams only)" ); } static void testAnalyzer(Analyzer analyzer, String analyzerDescription) throws Exception { System .out.println( "Testing analyzer " + analyzerDescription + "..." ); System .out.println( "---------------------------------" ); test(analyzer, "Lucene can index and can search" , "Lucene" ); test(analyzer, "Lucene can make an index" , "can" ); test(analyzer, "Lucene can index and can search" , "can" ); test(analyzer, "Lucene can index can search and can highlight" , "can" ); test(analyzer, "Lucene can index can search and can highlight" , "+index +search" ); System .out.println(); } static void test(Analyzer analyzer, String text, String queryStr) throws Exception { QueryParser parser = new QueryParser( "f" , analyzer); Query query = parser.parse(queryStr); QueryScorer scorer = new QueryScorer(query, "f" ); Highlighter h = new Highlighter(scorer); h.setTextFragmenter( new NullFragmenter()); // We're not testing fragmenter here. System .out.println(h.getBestFragment(analyzer, "f" , text) + " [query='" + queryStr + "']" ); } static class NGramAnalyzer extends Analyzer { public TokenStream tokenStream( String field, Reader input) { return new NGramTokenizer(input, 2, 2); } } static class BigramShingleAnalyzer extends Analyzer { boolean outputUnigrams; public BigramShingleAnalyzer( boolean outputUnigrams) { this .outputUnigrams = outputUnigrams; } public TokenStream tokenStream( String field, Reader input) { ShingleFilter sf = new ShingleFilter( new WhitespaceTokenizer(input)); sf.setOutputUnigrams(outputUnigrams); return sf; } } } Here's the current output, with commentary: Testing analyzer Bigram shingle analyzer (bigrams and unigrams)... --------------------------------- // works ok: <B>Lucene</B> can index and can search [query='Lucene'] // works ok: Lucene <B>can</B> make an index [query='can'] // same as Koji's example: Lucene <B>can index and can</B> search [query='can'] // if there are three matches, they all get bundled into a single highlight: Lucene <B>can index can search and can</B> highlight [query='can'] // it doesn't have to be the same search term that matches: Lucene can <B>index can search</B> and can highlight [query='+index +search'] Testing analyzer Bigram (non-shingle) analyzer (bigrams only)... --------------------------------- // works ok: <B>Lucene</B> can index and can search [query='Lucene'] // is 'an' being treated as a match for 'can'(?): Lucene <B>can make an</B> index [query='can'] // same as Koji's example: Lucene <B>can index and can</B> search [query='can'] // if there are three matches, they all get bundled into a single highlight: Lucene <B>can index can search and can</B> highlight [query='can'] // not sure what' happening here: Lucene can <B>index can search and</B> can highlight [query='+index +search'] I'm interested what others think, but for me it makes sense to classify both of these as the same issue. From a high-level perspective, the problem in each case seems to be that Highlighter.getBestTextFragments(TokenStream tokenStream, String text, boolean mergeContiguousFragments, int maxNumFragments) makes use of a TokenGroup abstraction that doesn't really work for the n-gram or the bigram shingle case: A TokenGroup is supposed to represent "one, or several overlapping tokens, along with the score(s) and the scope of the original text". (I assume TokenGroup was introduced to deal with synonym filter expansions.) Tokens are determined to overlap or not basically by seeing whether tokenB.startOffset() >= tokenA.endOffset(). (It's slightly more complex than this, but that's approximately what the test in TokenGroup.isDistinct() amounts to.) With the two analyzers under discussion, that criterion basically means that each token "overlaps" with the next. In Koji's bigram case, consider how "dogs" would get tokenized: "do" (startOffset=0, endOffset=2) "og" (startOffset=1, endOffset=3) "gs" (startOffset=2, endOffset=4) Or in my shingle case, consider how "I love Lucene" would get tokenized: "I" (startOffset=0, endOffset=1) "I love" (startOffset=0, endOffset=6) "love" (startOffset=2, endOffset=6) "love Lucene" (startOffset=2, endOffset=13) "Lucene" (startOffset=7, endOffset=13) In both cases, you never have a token whose startOffset is >= the preceding token's endOffset. So all these tokens are part of the same TokenGroup. That should mean these tokens all "overlap", but that would make for a rather mysterious notion of "overlapping".
        Hide
        Mark Harwood added a comment -

        It looks to me like this could be fixed in the "Formatter" classes when marking up the output string.

        Currently classes such as SimpleHTMLFormatter in their "highlightTerm" method put a tag around the whole section of text, if it contains a hit, i.e.

        SimpleHTMLFormatter.java
        	public String highlightTerm(String originalText, TokenGroup tokenGroup)
        	{
        		StringBuffer returnBuffer;
        		if(tokenGroup.getTotalScore()>0)
        		{
        			returnBuffer=new StringBuffer();
        			returnBuffer.append(preTag);
        			returnBuffer.append(originalText);
        			returnBuffer.append(postTag);
        			return returnBuffer.toString();
        		}
        		return originalText;
        	}
        

        The TokenGroup object passed to this method contains all of the tokens and their scores so it should be possible to use this information to deconstruct the originalText parameter and inject markup according to which tokens in the group had a match rather than putting a tag around the whole block. Some complexity may lie in handling token streams that produce tokens that "rewind" to earlier offsets.
        SimpleHtmlFormatter suddenly seems less simple!

        TokenStreams that produce entirely overlapping streams of tokens will automatically be broken into multiple TokenGroups because TokenGroup has a maximum number of linked Tokens it will ever hold in a single group.

        I haven't got the time to fix this right now but if someone has a burning need to leap in, the above seems like what may be required.

        Cheers
        Mark

        Show
        Mark Harwood added a comment - It looks to me like this could be fixed in the "Formatter" classes when marking up the output string. Currently classes such as SimpleHTMLFormatter in their "highlightTerm" method put a tag around the whole section of text, if it contains a hit, i.e. SimpleHTMLFormatter.java public String highlightTerm( String originalText, TokenGroup tokenGroup) { StringBuffer returnBuffer; if (tokenGroup.getTotalScore()>0) { returnBuffer= new StringBuffer (); returnBuffer.append(preTag); returnBuffer.append(originalText); returnBuffer.append(postTag); return returnBuffer.toString(); } return originalText; } The TokenGroup object passed to this method contains all of the tokens and their scores so it should be possible to use this information to deconstruct the originalText parameter and inject markup according to which tokens in the group had a match rather than putting a tag around the whole block. Some complexity may lie in handling token streams that produce tokens that "rewind" to earlier offsets. SimpleHtmlFormatter suddenly seems less simple! TokenStreams that produce entirely overlapping streams of tokens will automatically be broken into multiple TokenGroups because TokenGroup has a maximum number of linked Tokens it will ever hold in a single group. I haven't got the time to fix this right now but if someone has a burning need to leap in, the above seems like what may be required. Cheers Mark
        Hide
        David Bowen added a comment -

        Here's a patch to Highlighter.java that fixes the examples. The basic idea is to throw away (or ignore) overlapping tokens when they don't have a score, so that a token group doesn't get expanded beyond a sequence of tokens that should be highlighted.

        Show
        David Bowen added a comment - Here's a patch to Highlighter.java that fixes the examples. The basic idea is to throw away (or ignore) overlapping tokens when they don't have a score, so that a token group doesn't get expanded beyond a sequence of tokens that should be highlighted.
        David Bowen made changes -
        Field Original Value New Value
        Attachment lucene1489.patch [ 12421091 ]
        Hide
        David Bowen added a comment -

        Mark, I tried the approach you suggested of using the Formatter interface. I found it didn't work because the Formatter did not have a way to map the tokens in the token group into the text. This could be fixed by providing a public accessor function for TokenGroup's matchStartOffset field. However, it seems convoluted to go to the trouble of constructing a TokenGroup only to have every Formatter have to take it all apart again to find the places within it that need highlighting. It seems to me that the purpose of a TokenGroup is to identify (up to) one span of characters that needs to be highlighted.

        Show
        David Bowen added a comment - Mark, I tried the approach you suggested of using the Formatter interface. I found it didn't work because the Formatter did not have a way to map the tokens in the token group into the text. This could be fixed by providing a public accessor function for TokenGroup's matchStartOffset field. However, it seems convoluted to go to the trouble of constructing a TokenGroup only to have every Formatter have to take it all apart again to find the places within it that need highlighting. It seems to me that the purpose of a TokenGroup is to identify (up to) one span of characters that needs to be highlighted.
        Hide
        David Bowen added a comment -

        By the way. here is the output from Chris's test program with this patch:

        Testing analyzer Bigram shingle analyzer (bigrams and unigrams)...
        ---------------------------------
        <B>Lucene</B> can index and can search [query='Lucene']
        Lucene <B>can</B> make an index [query='can']
        Lucene <B>can</B> index and <B>can</B> search [query='can']
        Lucene <B>can</B> index <B>can</B> search and <B>can</B> highlight [query='can']
        Lucene can <B>index</B> can <B>search</B> and can highlight [query='+index +search']
        
        Testing analyzer Bigram (non-shingle) analyzer (bigrams only)...
        ---------------------------------
        <B>Lucene</B> can index and can search [query='Lucene']
        Lucene <B>can</B> make an index [query='can']
        Lucene <B>can</B> index and <B>can</B> search [query='can']
        Lucene <B>can</B> index <B>can</B> search and <B>can</B> highlight [query='can']
        Lucene can <B>index</B> can <B>search</B> and can highlight [query='+index +search']
        
        Show
        David Bowen added a comment - By the way. here is the output from Chris's test program with this patch: Testing analyzer Bigram shingle analyzer (bigrams and unigrams)... --------------------------------- <B>Lucene</B> can index and can search [query='Lucene'] Lucene <B>can</B> make an index [query='can'] Lucene <B>can</B> index and <B>can</B> search [query='can'] Lucene <B>can</B> index <B>can</B> search and <B>can</B> highlight [query='can'] Lucene can <B>index</B> can <B>search</B> and can highlight [query='+index +search'] Testing analyzer Bigram (non-shingle) analyzer (bigrams only)... --------------------------------- <B>Lucene</B> can index and can search [query='Lucene'] Lucene <B>can</B> make an index [query='can'] Lucene <B>can</B> index and <B>can</B> search [query='can'] Lucene <B>can</B> index <B>can</B> search and <B>can</B> highlight [query='can'] Lucene can <B>index</B> can <B>search</B> and can highlight [query='+index +search']
        Hide
        David Bowen added a comment -

        Updated patch to work with tokenizer API changes.

        Show
        David Bowen added a comment - Updated patch to work with tokenizer API changes.
        David Bowen made changes -
        Attachment LUCENE-1489.patch [ 12427643 ]
        Hide
        Jens Muecke added a comment -

        I tried this patch. After applying, following testcase fail:

            [junit] Testcase: testOverlapAnalyzer2(org.apache.lucene.search.highlight.HighlighterTest):	FAILED
            [junit] null expected:<<B>Hi[-]Speed</B>10 foo> but was:<<B>Hi[</B>-<B>]Speed</B>10 foo>
            [junit] junit.framework.ComparisonFailure: null expected:<<B>Hi[-]Speed</B>10 foo> but was:<<B>Hi[</B>-<B>]Speed</B>10 foo>
            [junit] 	at org.apache.lucene.search.highlight.HighlighterTest$30.run(HighlighterTest.java:1558)
            [junit] 	at org.apache.lucene.search.highlight.SynonymTokenizer$TestHighlightRunner.start(HighlighterTest.java:1947)
            [junit] 	at org.apache.lucene.search.highlight.HighlighterTest.testOverlapAnalyzer2(HighlighterTest.java:1594)
            [junit] 	at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:212)
            [junit] 
            [junit] 
            [junit] Test org.apache.lucene.search.highlight.HighlighterTest FAILED
        
        Show
        Jens Muecke added a comment - I tried this patch. After applying, following testcase fail: [junit] Testcase: testOverlapAnalyzer2(org.apache.lucene.search.highlight.HighlighterTest): FAILED [junit] null expected:<<B>Hi[-]Speed</B>10 foo> but was:<<B>Hi[</B>-<B>]Speed</B>10 foo> [junit] junit.framework.ComparisonFailure: null expected:<<B>Hi[-]Speed</B>10 foo> but was:<<B>Hi[</B>-<B>]Speed</B>10 foo> [junit] at org.apache.lucene.search.highlight.HighlighterTest$30.run(HighlighterTest.java:1558) [junit] at org.apache.lucene.search.highlight.SynonymTokenizer$TestHighlightRunner.start(HighlighterTest.java:1947) [junit] at org.apache.lucene.search.highlight.HighlighterTest.testOverlapAnalyzer2(HighlighterTest.java:1594) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:212) [junit] [junit] [junit] Test org.apache.lucene.search.highlight.HighlighterTest FAILED
        Mark Thomas made changes -
        Workflow jira [ 12448335 ] Default workflow, editable Closed status [ 12562487 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562487 ] jira [ 12584773 ]
        Hide
        Lance Norskog added a comment -

        Is this still a problem?

        Show
        Lance Norskog added a comment - Is this still a problem?
        Hide
        Koji Sekiguchi added a comment -

        This problem was my motivation for creating FastVectorHighlighter. Once FVH was committed, I'd never tried the combination of CJKTokenizer + Highlighter. So, I don't know the latest situation.

        I'd like to close this as won't fix.

        Show
        Koji Sekiguchi added a comment - This problem was my motivation for creating FastVectorHighlighter. Once FVH was committed, I'd never tried the combination of CJKTokenizer + Highlighter. So, I don't know the latest situation. I'd like to close this as won't fix.
        Koji Sekiguchi made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Koji Sekiguchi
          • Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development