Solr
  1. Solr
  2. SOLR-24

Add Highlighting to standard request handler

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: search
    • Labels:
      None

      Description

      This patch adds highlighting functionality to solr request handlers it also refactors StandardRequestHandler to use the common functionality provided in SolrPluginUtils. I'd have preferred to do two separate patches, but creating two mutually-dependent patches on a repo without being able to commit a revision was daunting.

      -----------------------------------
      Refactoring StandardRequestHandler:

      1. Moved solr.util.CommonParams to its own class. Removed DisMax-specific parameters, and placed in a subclass.
      2. StandardRequestHandler uses CommonParams to store config-time parameter values (new feature)
      3. StandardRequestHandler uses SolrPluginUtils methods for duplicate functionality
      4. Some of said SPU methods have grown a "params" parameter to enable them to use default values. (Note: instead of passing this around, something like a RequestHelper class which carries the SolrRequest and Param values would be useful. This class could house the utility methods that require Request parameters).
      5. SolrPluginUtils.getParam() only uses the default parameter if it is null, not blank.

      --------------------------------------
      Highlighting:

      1. Highlighting is controlled by three request parameters:
      highlight: list of fields to highlight, or highlight the default field if at all present
      maxSnippets: maximum number of snippets to return for each field
      highlightFormatterClass: 'solr.<classname>' or full package path of highlight.Formatter subclass to use in highlighting.
      2. Default formatter is to use <em> tags. There are issues with this approach, but are mitigated with the ability to specify a custom Formatter. Definately should consider alternatives (a custom xml approach to denote highlit regions will require some Highlighter package hackery).
      3. Document summaries are returned as a separate element under <response> format is still up for discussion.

      1. highlight_patch_v1.diff
        40 kB
        Mike Klaas
      2. highlight_patch_v2.diff
        51 kB
        Mike Klaas
      3. highlight_patch_v3.diff
        51 kB
        Mike Klaas
      4. highlight_patch_v4.diff
        48 kB
        Mike Klaas
      5. highlight_v5.patch
        53 kB
        Mike Klaas

        Activity

        Hide
        Bill Au added a comment -

        I have come across a test case there the patch doesn't work. I am using the data in example/exampledocs/ipod_other.xml.
        After adding the two docs I queried solr using curl:

        curl -i "http://localhost:8983/solr/select/?q=ipod&highlight=name&indent=true"

        I got the following highlighting data in the response:

        <lst name="highlighting">
        <lst name="id=IW-02">
        <arr name="name">
        <str>i<em>iPod</em> & i<em>iPod</em> Mini USB 22.0 Cable</str>
        </arr>
        </lst>
        <lst name="id=F8V7067-APL-KIT">
        <arr name="name">
        <str>Belkin Mobile Power Cord for i<em>iPod</em> w/ Dock</str>
        </arr>
        </lst>
        </lst>

        In both cases, there is an extra i in front of <em>iPod</em>.
        My guess is there is something wrong with the TokenStream passed to the Highligher. I am taking a closing look at that now.

        Show
        Bill Au added a comment - I have come across a test case there the patch doesn't work. I am using the data in example/exampledocs/ipod_other.xml. After adding the two docs I queried solr using curl: curl -i "http://localhost:8983/solr/select/?q=ipod&highlight=name&indent=true" I got the following highlighting data in the response: <lst name="highlighting"> <lst name="id=IW-02"> <arr name="name"> <str>i<em>iPod</em> & i<em>iPod</em> Mini USB 22.0 Cable</str> </arr> </lst> <lst name="id=F8V7067-APL-KIT"> <arr name="name"> <str>Belkin Mobile Power Cord for i<em>iPod</em> w/ Dock</str> </arr> </lst> </lst> In both cases, there is an extra i in front of <em>iPod</em>. My guess is there is something wrong with the TokenStream passed to the Highligher. I am taking a closing look at that now.
        Hide
        Hoss Man added a comment -

        Bill: I beleive what you are seeing is an issue with the Highlighter not playing nice with Analyzers that produce multiple tokens in the same position – Mike mentioend this in the email discussion, but i assumed it was reagarding using TermPosition information ... in the example app "name" is a stored field, so i'm not sure why the indexing Analyzer would affect highlighting (perhaps it's yet anohter aspect of the highlighter package i don't understand)

        Incidently, We might as well capture here a summary of the dicussion from last week...
        http://www.nabble.com/-jira--Created%3A-%28SOLR-24%29-Add-Highlighting-to-standard-request-handler-tf1834206.html

        1) patch generally look good
        2) I'm hesitent to change SolrPluginUtils.getParam()

        In this particular case (using the "highlight" param to indicate both that highlighting should be turned on, and to override the default highlighting fields) I think a better API would be two seperate params, and the more i think about it the more i like that solution in general – much as Mike pointed out that "Significant whitespace scares me" in response to truely ridiculous suggestion i made, significant empty space scares me ... so i would vote for changing that and then commiting.

        Incidently, i got a response from Erik H giving this patch his thumbs up a few days ago, but looking at the thread now i realize it came staight to me and not the list.

        Show
        Hoss Man added a comment - Bill: I beleive what you are seeing is an issue with the Highlighter not playing nice with Analyzers that produce multiple tokens in the same position – Mike mentioend this in the email discussion, but i assumed it was reagarding using TermPosition information ... in the example app "name" is a stored field, so i'm not sure why the indexing Analyzer would affect highlighting (perhaps it's yet anohter aspect of the highlighter package i don't understand) Incidently, We might as well capture here a summary of the dicussion from last week... http://www.nabble.com/-jira--Created%3A-%28SOLR-24%29-Add-Highlighting-to-standard-request-handler-tf1834206.html 1) patch generally look good 2) I'm hesitent to change SolrPluginUtils.getParam() In this particular case (using the "highlight" param to indicate both that highlighting should be turned on, and to override the default highlighting fields) I think a better API would be two seperate params, and the more i think about it the more i like that solution in general – much as Mike pointed out that "Significant whitespace scares me" in response to truely ridiculous suggestion i made, significant empty space scares me ... so i would vote for changing that and then commiting. Incidently, i got a response from Erik H giving this patch his thumbs up a few days ago, but looking at the thread now i realize it came staight to me and not the list.
        Hide
        Mike Klaas added a comment -

        Just to follow up, I'll be creating an new patch which implements the changes in highlighting parameters (this should be ready in a few days).

        The issue with the doubled tokens only occurs when analyzing a stored field for highlighting (it does not occur when using Term Position vectors). Stored fields need to be analyzed on the fly to do highlighting matching to the query term.

        Show
        Mike Klaas added a comment - Just to follow up, I'll be creating an new patch which implements the changes in highlighting parameters (this should be ready in a few days). The issue with the doubled tokens only occurs when analyzing a stored field for highlighting (it does not occur when using Term Position vectors). Stored fields need to be analyzed on the fly to do highlighting matching to the query term.
        Hide
        Bill Au added a comment -

        The TokenStream looks OK. It contains "i", "pod", and "ipod".

        I also tried using different query terms.
        q=pod produces the same output as q=ipod.

        q=iPod produces output with the extra "i" highlighted as well, ie

        <lst name="highlighting">
        <lst name="id=IW-02">
        <arr name="name">
        <str><em>i</em><em>iPod</em> & <em>i</em><em>iPod</em> Mini USB 22.0 Cable</str>
        </arr>
        </lst>
        <lst name="id=F8V7067-APL-KIT">
        <arr name="name">
        <str>Belkin Mobile Power Cord for <em>i</em><em>iPod</em> w/ Dock</str>
        </arr>
        </lst>
        </lst>

        Show
        Bill Au added a comment - The TokenStream looks OK. It contains "i", "pod", and "ipod". I also tried using different query terms. q=pod produces the same output as q=ipod. q=iPod produces output with the extra "i" highlighted as well, ie <lst name="highlighting"> <lst name="id=IW-02"> <arr name="name"> <str><em>i</em><em>iPod</em> & <em>i</em><em>iPod</em> Mini USB 22.0 Cable</str> </arr> </lst> <lst name="id=F8V7067-APL-KIT"> <arr name="name"> <str>Belkin Mobile Power Cord for <em>i</em><em>iPod</em> w/ Dock</str> </arr> </lst> </lst>
        Hide
        Mike Klaas added a comment -

        Updated highlighting patch.

        Numerous improvements:

        • uses three parameters: highlight (true/false) highlightFields maxSnippets
        • vastly improved defaultField handling (old version ignored some cases of specified default fields, and did the wrong thing entirely for DisMax)
        • vastly improved DisMax query handling

        Open issues:

        • I needed to subclass a final class in the highlighter contrib package in lucene, which required C&P. This is a stop-gap measure until the issue is fixed in lucene.
        • The interaction between WordDelimiterFilter and Highlighter is not fixed. This will require further investigation.

        I believe the patch is in a commitable state. If it is committed, I'll open JIRA issues for the open problems above.

        Show
        Mike Klaas added a comment - Updated highlighting patch. Numerous improvements: uses three parameters: highlight (true/false) highlightFields maxSnippets vastly improved defaultField handling (old version ignored some cases of specified default fields, and did the wrong thing entirely for DisMax) vastly improved DisMax query handling Open issues: I needed to subclass a final class in the highlighter contrib package in lucene, which required C&P. This is a stop-gap measure until the issue is fixed in lucene. The interaction between WordDelimiterFilter and Highlighter is not fixed. This will require further investigation. I believe the patch is in a commitable state. If it is committed, I'll open JIRA issues for the open problems above.
        Hide
        Mike Klaas added a comment -

        The previous patch didn't apply cleanly to trunk due to svn keyword. This version omits a docstring change.

        Show
        Mike Klaas added a comment - The previous patch didn't apply cleanly to trunk due to svn keyword. This version omits a docstring change.
        Hide
        Bill Au added a comment -

        My guess is that the Highlighter doesn't like the fact that the
        TokenStream produced by the WordDelimiterFilter contains
        multiple Tokens with the same starting offset.

        Show
        Bill Au added a comment - My guess is that the Highlighter doesn't like the fact that the TokenStream produced by the WordDelimiterFilter contains multiple Tokens with the same starting offset.
        Hide
        Mike Klaas added a comment -

        Updated patch reflecting comments:

        • splitList typo fixed
        • fields producing no highlight snippets are not included in result tag
        • full support for multi-valued fields, this includes a mutl-value token stream and TextFragmenter which starts fragments on value boundaries
        • use of SolrQueryTermExtractor removed (not needed with current version of lucene)
        • "id=key" attributes are stripped if the schema contains no uniqueKey
        • Factored out response field setting from StandardRequestHandler (note, this is a slight semantic change, as previously the score was only included if "score" was sent in the field list. The current behavious matched DisMaxRequestHandler, but not the documentation. If this behaviour is desirable, both handlers should be modified to comply).
        Show
        Mike Klaas added a comment - Updated patch reflecting comments: splitList typo fixed fields producing no highlight snippets are not included in result tag full support for multi-valued fields, this includes a mutl-value token stream and TextFragmenter which starts fragments on value boundaries use of SolrQueryTermExtractor removed (not needed with current version of lucene) "id=key" attributes are stripped if the schema contains no uniqueKey Factored out response field setting from StandardRequestHandler (note, this is a slight semantic change, as previously the score was only included if "score" was sent in the field list. The current behavious matched DisMaxRequestHandler, but not the documentation. If this behaviour is desirable, both handlers should be modified to comply).
        Hide
        Yonik Seeley added a comment -

        I just got a chance to check this out... it's looking good Mike! I think this is something that will be useful to a lot of people.

        Perhaps highlight=true could be overloaded later to provide alternate ways of highlighting (giving positions rather than inserting <em> in the values). While <em> is a little flawed since it's problematic in field values that may already contain markup, I think it's probably the best/simplest default that people can use easily.

        Regarding the <lst name="id=theuniqueid">... should this be <lst name="theuinqueid"> without including the field name?
        Including the unique key field name (which I assume id is), does add a little more info, at the cost of making it harder for clients to use the value.

        I like that you leave in empty entries like <lst name="id=VS1GB400C3"/> for documents that there are no highlighter fragments for. This makes it possible to access the highlight info by position rather than relying on id correlation.

        I wasn't able to get multi-valued fields working though. Even if multiple values matched, I would only get a single <str> back in the highlight section.
        Example with the demo document set:
        http://localhost:8983/solr/select?q=features:(USB+video+color+includes+MB+memory+display+photo)&indent=on&highlight=true&highlightFields=features

        <lst name="highlighting">
        <lst name="id=MA147LL/A">
        <arr name="features">
        <str>Stores up to 1515,000 songs, 2525,000 <em>photos</em>, or 150 hours of <em>video</em></str>
        </arr>
        </lst>

        Show
        Yonik Seeley added a comment - I just got a chance to check this out... it's looking good Mike! I think this is something that will be useful to a lot of people. Perhaps highlight=true could be overloaded later to provide alternate ways of highlighting (giving positions rather than inserting <em> in the values). While <em> is a little flawed since it's problematic in field values that may already contain markup, I think it's probably the best/simplest default that people can use easily. Regarding the <lst name="id=theuniqueid">... should this be <lst name="theuinqueid"> without including the field name? Including the unique key field name (which I assume id is), does add a little more info, at the cost of making it harder for clients to use the value. I like that you leave in empty entries like <lst name="id=VS1GB400C3"/> for documents that there are no highlighter fragments for. This makes it possible to access the highlight info by position rather than relying on id correlation. I wasn't able to get multi-valued fields working though. Even if multiple values matched, I would only get a single <str> back in the highlight section. Example with the demo document set: http://localhost:8983/solr/select?q=features:(USB+video+color+includes+MB+memory+display+photo)&indent=on&highlight=true&highlightFields=features <lst name="highlighting"> <lst name="id=MA147LL/A"> <arr name="features"> <str>Stores up to 1515,000 songs, 2525,000 <em>photos</em>, or 150 hours of <em>video</em></str> </arr> </lst>
        Hide
        Yonik Seeley added a comment -

        > I wasn't able to get multi-valued fields working though.

        As I started browsing the code I ran across the maxSnippets param and cranked it up... now all the snippets appear as a single string rather than multiple strings in the array. Am I doing something wrong?

        Example:
        <lst name="id=MA147LL/A">
        <arr name="features">
        <str>
        iiTunes, Podcasts, AudiobooksStores up to 1515,000 songs, 2525,000 <em>photos</em>, or 150 hours of <em>video</em>22.5-inch, 320x240 <em>color</em> TFT LCD <em>display</em> with LED backlightUp to 20 hours of battery lifePlays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 <em>video</em>Notes, Calendar, Phone book, Hold button, Date <em>display</em>, <em>Photo</em> wallet, BuiltBuilt-in games, JPEG <em>photo</em> playback, Upgradeable firmware, <em>USB</em> 22.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication
        </str>
        </arr>
        </lst>

        Show
        Yonik Seeley added a comment - > I wasn't able to get multi-valued fields working though. As I started browsing the code I ran across the maxSnippets param and cranked it up... now all the snippets appear as a single string rather than multiple strings in the array. Am I doing something wrong? Example: <lst name="id=MA147LL/A"> <arr name="features"> <str> iiTunes, Podcasts, AudiobooksStores up to 1515,000 songs, 2525,000 <em>photos</em>, or 150 hours of <em>video</em>22.5-inch, 320x240 <em>color</em> TFT LCD <em>display</em> with LED backlightUp to 20 hours of battery lifePlays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 <em>video</em>Notes, Calendar, Phone book, Hold button, Date <em>display</em>, <em>Photo</em> wallet, BuiltBuilt-in games, JPEG <em>photo</em> playback, Upgradeable firmware, <em>USB</em> 22.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication </str> </arr> </lst>
        Hide
        Mike Klaas added a comment -

        Interesting... Highlighter by default merges contiguous fragments (and fragments can be contiguous regardless of the gap between them).

        I made a change to disable fragment merging and the bahaviour is more natural. I'll post a modified patch once we determine what othe rmodifications are necessary.

        Show
        Mike Klaas added a comment - Interesting... Highlighter by default merges contiguous fragments (and fragments can be contiguous regardless of the gap between them). I made a change to disable fragment merging and the bahaviour is more natural. I'll post a modified patch once we determine what othe rmodifications are necessary.
        Hide
        Mike Klaas added a comment -

        I also agree about removing the "id=" (I was just blithly following the example of the debug data here).

        Another note to reviewers: in the process of Gap-enabled the text fragmenter, I introduced a bug that caused over-long fragments in regular fields. This has been fix on my local copy.

        Show
        Mike Klaas added a comment - I also agree about removing the "id=" (I was just blithly following the example of the debug data here). Another note to reviewers: in the process of Gap-enabled the text fragmenter, I introduced a bug that caused over-long fragments in regular fields. This has been fix on my local copy.
        Hide
        Yonik Seeley added a comment -

        > - Factored out response field setting from StandardRequestHandler (note, this is a slight semantic change, as previously
        > the score was only included if "score" was sent in the field list.

        This was done because there is an optimization that can use a filter (the set of all documents matching a query) to satisfy a sorted query if scores aren't needed. This bypasses re-executing the query.

        > I also agree about removing the "id=" (I was just blithly following the example of the debug data here).

        Yeah, I really only meant for the debug stuff to be human-readable (and more likely to change in the future).

        Regarding gaps... I can see how one would need to rely on a position gap when using term-vectors... but when re-analyzing stored fields, they are already discrete. Is the problem caused by the hilighter architecture (I haven't used it before)?

        Show
        Yonik Seeley added a comment - > - Factored out response field setting from StandardRequestHandler (note, this is a slight semantic change, as previously > the score was only included if "score" was sent in the field list. This was done because there is an optimization that can use a filter (the set of all documents matching a query) to satisfy a sorted query if scores aren't needed. This bypasses re-executing the query. > I also agree about removing the "id=" (I was just blithly following the example of the debug data here). Yeah, I really only meant for the debug stuff to be human-readable (and more likely to change in the future). Regarding gaps... I can see how one would need to rely on a position gap when using term-vectors... but when re-analyzing stored fields, they are already discrete. Is the problem caused by the hilighter architecture (I haven't used it before)?
        Hide
        Mike Klaas added a comment -

        > This was done because there is an optimization that can use a filter (the set of all documents matching a query) to satisfy > a sorted query if scores aren't needed. This bypasses re-executing the query.

        I'll ensure this skippage is possible with StandardrequestHandler. It should probably be added to DisMax too, in that case.

        > Regarding gaps... I can see how one would need to rely on a position gap when using term-vectors... but when > > re-analyzing stored fields, they are already discrete. Is the problem caused by the hilighter architecture (I haven't used it >before)?

        Highlighter takes a tokenstream and a piece of text and fragments it. The fragments are scored to determined which top set of them to return. The only way for it to work on multiple token streams is to invoke it multiple times (in which case we have to find a way of merging the highlighting output from each in a nice way), or fooling highlighter into thinking it is a signle stream (ensuring separation among the various parts), which is attractive since Highlighter compares the fragments from all the parts and picks the globally highest scoring fragments.

        Show
        Mike Klaas added a comment - > This was done because there is an optimization that can use a filter (the set of all documents matching a query) to satisfy > a sorted query if scores aren't needed. This bypasses re-executing the query. I'll ensure this skippage is possible with StandardrequestHandler. It should probably be added to DisMax too, in that case. > Regarding gaps... I can see how one would need to rely on a position gap when using term-vectors... but when > > re-analyzing stored fields, they are already discrete. Is the problem caused by the hilighter architecture (I haven't used it >before)? Highlighter takes a tokenstream and a piece of text and fragments it. The fragments are scored to determined which top set of them to return. The only way for it to work on multiple token streams is to invoke it multiple times (in which case we have to find a way of merging the highlighting output from each in a nice way), or fooling highlighter into thinking it is a signle stream (ensuring separation among the various parts), which is attractive since Highlighter compares the fragments from all the parts and picks the globally highest scoring fragments.
        Hide
        Mike Klaas added a comment -

        (hopefully) complete patch.

        • removed "id=" from document highlighting entries
        • ensured the multi-value snippets do not get concatenated
        • in StandardRequestHandler and DisMaxHandler, the score flag is set iff it is in the responsefields parameter
        Show
        Mike Klaas added a comment - (hopefully) complete patch. removed "id=" from document highlighting entries ensured the multi-value snippets do not get concatenated in StandardRequestHandler and DisMaxHandler, the score flag is set iff it is in the responsefields parameter
        Hide
        Yonik Seeley added a comment -

        +1 for committing this.
        Excellent job Mike!

        Show
        Yonik Seeley added a comment - +1 for committing this. Excellent job Mike!
        Hide
        Yonik Seeley added a comment -

        I just committed this, and opened a lucene bug
        http://issues.apache.org/jira/browse/LUCENE-627

        Now we just need to update the wiki!

        Show
        Yonik Seeley added a comment - I just committed this, and opened a lucene bug http://issues.apache.org/jira/browse/LUCENE-627 Now we just need to update the wiki!
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.1

        The Fix Version for all 38 issues found was set to 1.1, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman3

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.1 The Fix Version for all 38 issues found was set to 1.1, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman3

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Mike Klaas
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development