Solr
  1. Solr
  2. SOLR-195

Wildcard/prefix queries not highlighted

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 1.1.0, 1.2
    • Fix Version/s: None
    • Component/s: highlighter
    • Labels:
      None

      Issue Links

        Activity

        Hide
        Koji Sekiguchi added a comment -

        This should be done at some point. Fixing it as duplicate.

        Show
        Koji Sekiguchi added a comment - This should be done at some point. Fixing it as duplicate.
        Hide
        David Smiley added a comment -

        Shouldn't this issue be closed as a duplicate, given Mark's comment? Though SOLR-825 is only for the phrase highlighter, not otherwise. That doesn't strike me as a big deal since the phrase highlighter is what everyone should use (yet strangely isn't the default).

        Show
        David Smiley added a comment - Shouldn't this issue be closed as a duplicate, given Mark's comment? Though SOLR-825 is only for the phrase highlighter, not otherwise. That doesn't strike me as a big deal since the phrase highlighter is what everyone should use (yet strangely isn't the default).
        Hide
        Mark Miller added a comment -

        This now works with SOLR-825.

        Show
        Mark Miller added a comment - This now works with SOLR-825 .
        Hide
        Chris Harris added a comment -

        I just rediscovered this bug for myself, and was about to re-report it, but then I found this JIRA issue. Even though it's a bit redundant, I'm going to paste my bug report here, since A) I think it's a good summary of the problem B) it has a remark for when usePhraseHighlighter=true, and C) it includes a few test cases.

        ****

        Highlighting with wildcards (whether * is in the middle of a term or at
        the end) doesn't work right now for the standard request handler.
        The high-level view of the problem is as follows:

        1. Extracting terms is central to highlighting
        2. Wildcard queries get parsed into ConstantScoreQuery objects
        3. It's not currently possible to extract terms from
        ConstantScoreQuery objects

        ****

        Wildcard queries get turned into ConstantScoreQuery objects. For non-prefix
        wildcards (e.g. "l*g"), the query parser directly returns a
        ConstantScoreQuery with filter = WildcardFilter. For prefix wildcards
        (e.g. "lon*"), the query parser returns a ConstantScorePrefixQuery,
        but it gets rewritten (by Query.rewrite(), which gets called in the
        highlighting component) into a ConstantScoreQuery with
        filter = PrefixFilter.

        If usePhraseHighlighter=false, then a key part of highlighting is
        Query.extractTerms(). However, ConstantScoreQuery.extractTerms()
        is an empty method. The source itself notes that this may not
        be good for highlighting: "OK to not add any terms when used for
        MultiSearcher, but may not be OK for highlighting."

        If usePhraseHighlighter=true, then a key part of highlighting is
        WeightedSpanTermExtractor.extract(Query, Map). Now extract() has
        a number of different instanceof clauses, each with knowledge about
        how to extract terms from a particular kind of query. However, there
        is no instanceof clause that matches ConstantScoreQuery.

        ****

        Here are four variants on testDefaultFieldHighlight() that all fail, even
        though I think they should pass. (The differences from
        testDefaultFieldHighlight are the hl.usePhraseHighlighter param and the
        use of wildcard in sumLRF.makeRequest.) When I run them, they each return
        a document, as expected, but they don't find any highlight blocks.

          public void testDefaultFieldPrefixWildcardHighlight() {
        
            // do summarization using re-analysis of the field
            HashMap<String,String> args = new HashMap<String,String>();
            args.put("hl", "true");
            args.put("df", "t_text");
            args.put("hl.fl", "");
            args.put("hl.usePhraseHighlighter", "false");
            TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
              "standard", 0, 200, args);
            
            assertU(adoc("t_text", "a long day's night", "id", "1"));
            assertU(commit());
            assertU(optimize());
            assertQ("Basic summarization",
                    sumLRF.makeRequest("lon*"),
                    "//lst[@name='highlighting']/lst[@name='1']",
                    "//lst[@name='1']/arr[@name='t_text']/str"
                    );
        
          }
        
          public void testDefaultFieldPrefixWildcardHighlight2() {
        
            // do summarization using re-analysis of the field
            HashMap<String,String> args = new HashMap<String,String>();
            args.put("hl", "true");
            args.put("df", "t_text");
            args.put("hl.fl", "");
            args.put("hl.usePhraseHighlighter", "true");
            TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
              "standard", 0, 200, args);
            
            assertU(adoc("t_text", "a long day's night", "id", "1"));
            assertU(commit());
            assertU(optimize());
            assertQ("Basic summarization",
                    sumLRF.makeRequest("lon*"),
                    "//lst[@name='highlighting']/lst[@name='1']",
                    "//lst[@name='1']/arr[@name='t_text']/str"
                    );
        
          }
        
          public void testDefaultFieldNonPrefixWildcardHighlight() {
        
            // do summarization using re-analysis of the field
            HashMap<String,String> args = new HashMap<String,String>();
            args.put("hl", "true");
            args.put("df", "t_text");
            args.put("hl.fl", "");
            args.put("hl.usePhraseHighlighter", "false");
            TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
              "standard", 0, 200, args);
            
            assertU(adoc("t_text", "a long day's night", "id", "1"));
            assertU(commit());
            assertU(optimize());
            assertQ("Basic summarization",
                    sumLRF.makeRequest("l*g"),
                    "//lst[@name='highlighting']/lst[@name='1']",
                    "//lst[@name='1']/arr[@name='t_text']/str"
                    );
        
          }
        
          public void testDefaultFieldNonPrefixWildcardHighlight2() {
        
            // do summarization using re-analysis of the field
            HashMap<String,String> args = new HashMap<String,String>();
            args.put("hl", "true");
            args.put("df", "t_text");
            args.put("hl.fl", "");
            args.put("hl.usePhraseHighlighter", "true");
            TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
              "standard", 0, 200, args);
            
            assertU(adoc("t_text", "a long day's night", "id", "1"));
            assertU(commit());
            assertU(optimize());
            assertQ("Basic summarization",
                    sumLRF.makeRequest("l*g"),
                    "//lst[@name='highlighting']/lst[@name='1']",
                    "//lst[@name='1']/arr[@name='t_text']/str"
                    );
        
          }
        
        Show
        Chris Harris added a comment - I just rediscovered this bug for myself, and was about to re-report it, but then I found this JIRA issue. Even though it's a bit redundant, I'm going to paste my bug report here, since A) I think it's a good summary of the problem B) it has a remark for when usePhraseHighlighter=true, and C) it includes a few test cases. **** Highlighting with wildcards (whether * is in the middle of a term or at the end) doesn't work right now for the standard request handler. The high-level view of the problem is as follows: 1. Extracting terms is central to highlighting 2. Wildcard queries get parsed into ConstantScoreQuery objects 3. It's not currently possible to extract terms from ConstantScoreQuery objects **** Wildcard queries get turned into ConstantScoreQuery objects. For non-prefix wildcards (e.g. "l*g"), the query parser directly returns a ConstantScoreQuery with filter = WildcardFilter. For prefix wildcards (e.g. "lon*"), the query parser returns a ConstantScorePrefixQuery, but it gets rewritten (by Query.rewrite(), which gets called in the highlighting component) into a ConstantScoreQuery with filter = PrefixFilter. If usePhraseHighlighter=false, then a key part of highlighting is Query.extractTerms(). However, ConstantScoreQuery.extractTerms() is an empty method. The source itself notes that this may not be good for highlighting: "OK to not add any terms when used for MultiSearcher, but may not be OK for highlighting." If usePhraseHighlighter=true, then a key part of highlighting is WeightedSpanTermExtractor.extract(Query, Map). Now extract() has a number of different instanceof clauses, each with knowledge about how to extract terms from a particular kind of query. However, there is no instanceof clause that matches ConstantScoreQuery. **** Here are four variants on testDefaultFieldHighlight() that all fail, even though I think they should pass. (The differences from testDefaultFieldHighlight are the hl.usePhraseHighlighter param and the use of wildcard in sumLRF.makeRequest.) When I run them, they each return a document, as expected, but they don't find any highlight blocks. public void testDefaultFieldPrefixWildcardHighlight() { // do summarization using re-analysis of the field HashMap< String , String > args = new HashMap< String , String >(); args.put( "hl" , " true " ); args.put( "df" , "t_text" ); args.put( "hl.fl" , ""); args.put( "hl.usePhraseHighlighter" , " false " ); TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory( "standard" , 0, 200, args); assertU(adoc( "t_text" , "a long day's night" , "id" , "1" )); assertU(commit()); assertU(optimize()); assertQ( "Basic summarization" , sumLRF.makeRequest( "lon*" ), " //lst[@name='highlighting']/lst[@name='1']" , " //lst[@name='1']/arr[@name='t_text']/str" ); } public void testDefaultFieldPrefixWildcardHighlight2() { // do summarization using re-analysis of the field HashMap< String , String > args = new HashMap< String , String >(); args.put( "hl" , " true " ); args.put( "df" , "t_text" ); args.put( "hl.fl" , ""); args.put( "hl.usePhraseHighlighter" , " true " ); TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory( "standard" , 0, 200, args); assertU(adoc( "t_text" , "a long day's night" , "id" , "1" )); assertU(commit()); assertU(optimize()); assertQ( "Basic summarization" , sumLRF.makeRequest( "lon*" ), " //lst[@name='highlighting']/lst[@name='1']" , " //lst[@name='1']/arr[@name='t_text']/str" ); } public void testDefaultFieldNonPrefixWildcardHighlight() { // do summarization using re-analysis of the field HashMap< String , String > args = new HashMap< String , String >(); args.put( "hl" , " true " ); args.put( "df" , "t_text" ); args.put( "hl.fl" , ""); args.put( "hl.usePhraseHighlighter" , " false " ); TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory( "standard" , 0, 200, args); assertU(adoc( "t_text" , "a long day's night" , "id" , "1" )); assertU(commit()); assertU(optimize()); assertQ( "Basic summarization" , sumLRF.makeRequest( "l*g" ), " //lst[@name='highlighting']/lst[@name='1']" , " //lst[@name='1']/arr[@name='t_text']/str" ); } public void testDefaultFieldNonPrefixWildcardHighlight2() { // do summarization using re-analysis of the field HashMap< String , String > args = new HashMap< String , String >(); args.put( "hl" , " true " ); args.put( "df" , "t_text" ); args.put( "hl.fl" , ""); args.put( "hl.usePhraseHighlighter" , " true " ); TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory( "standard" , 0, 200, args); assertU(adoc( "t_text" , "a long day's night" , "id" , "1" )); assertU(commit()); assertU(optimize()); assertQ( "Basic summarization" , sumLRF.makeRequest( "l*g" ), " //lst[@name='highlighting']/lst[@name='1']" , " //lst[@name='1']/arr[@name='t_text']/str" ); }
        Hide
        Hoss Man added a comment -

        it would be hacky ... but it would be workable.

        Show
        Hoss Man added a comment - it would be hacky ... but it would be workable.
        Hide
        J.J. Larrea added a comment -

        Until such time as someone implements one of the approaches for extractTerms() in the ConstantScoreXXXQuery classes in Lucene, would a workable workaround (at least for StandardRequestHandler, DisMax might be trickier) be to have the RH parse the query twice, once with the ConstantScore optimizations enabled as usual for generating the hits, and (with a trivial change to SolrQueryParser etc.) once with them disabled for highlighting? The BooleanQuery clause limit is probably more acceptable for highlighting than for generating hits, the PrefixFilter speed improvements would still be in effect generating the hits, and the query would not need to be externally munged. Or is that too hacky?

        Show
        J.J. Larrea added a comment - Until such time as someone implements one of the approaches for extractTerms() in the ConstantScoreXXXQuery classes in Lucene, would a workable workaround (at least for StandardRequestHandler, DisMax might be trickier) be to have the RH parse the query twice, once with the ConstantScore optimizations enabled as usual for generating the hits, and (with a trivial change to SolrQueryParser etc.) once with them disabled for highlighting? The BooleanQuery clause limit is probably more acceptable for highlighting than for generating hits, the PrefixFilter speed improvements would still be in effect generating the hits, and the query would not need to be externally munged. Or is that too hacky?
        Hide
        Xuesong Luo added a comment -

        (auto auto?) can be used as a work around for auto if you do want highlighting.

        Show
        Xuesong Luo added a comment - (auto auto? ) can be used as a work around for auto if you do want highlighting.
        Hide
        Hoss Man added a comment -

        a follow up note: as mentioned in the email thread linked to in the issue report, one work arround people may want to consider if highlighting is important (at the expense of the PrefixFilter optimization) is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a "?" before the "*"

        ie: auto?* instead of auto*

        (yes, this does require that at least one character follow the prefix)

        Show
        Hoss Man added a comment - a follow up note: as mentioned in the email thread linked to in the issue report, one work arround people may want to consider if highlighting is important (at the expense of the PrefixFilter optimization) is to force the use of a WildCardQuery in what would otherwise be interpreted as a PrefixQuery by putting a "?" before the "*" ie: auto?* instead of auto* (yes, this does require that at least one character follow the prefix)
        Hide
        Yonik Seeley added a comment -

        I'm not sure if rewrite never really added much to Lucene though...
        ConstantScoreRangeQuery and ConstantScorePrefixQuery could just as easily reuse a common scorer than re-write to ConstantScoreQuery. Seems like we should fix it at the Lucene level though.

        Show
        Yonik Seeley added a comment - I'm not sure if rewrite never really added much to Lucene though... ConstantScoreRangeQuery and ConstantScorePrefixQuery could just as easily reuse a common scorer than re-write to ConstantScoreQuery. Seems like we should fix it at the Lucene level though.
        Hide
        Hoss Man added a comment -

        Hmmm... perhaps we should consider the larger Lucene issue of ConstantScoreQuery, extractTerms, and highlighting ... maybe CSQ should have a method for specifying a callback to use if/when extractTerms is called?

        Show
        Hoss Man added a comment - Hmmm... perhaps we should consider the larger Lucene issue of ConstantScoreQuery, extractTerms, and highlighting ... maybe CSQ should have a method for specifying a callback to use if/when extractTerms is called?
        Hide
        Yonik Seeley added a comment -

        Problem is, rewrite() changes ConstantScorePrefixQuery to ConstantScoreQuery.
        Perhaps we would change ConstantScorePrefixQuery.rewrite to a no-op, then implement extractTerms()

        Show
        Yonik Seeley added a comment - Problem is, rewrite() changes ConstantScorePrefixQuery to ConstantScoreQuery. Perhaps we would change ConstantScorePrefixQuery.rewrite to a no-op, then implement extractTerms()
        Hide
        Hoss Man added a comment -

        i bet this could be fixed by adding an extractTerms method to ConstantScorePrefixQuery

        Show
        Hoss Man added a comment - i bet this could be fixed by adding an extractTerms method to ConstantScorePrefixQuery

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Mike Klaas
          • Votes:
            3 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development