Solr
  1. Solr
  2. SOLR-1954

Highlighter component should expose snippet character offsets and the score.

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: highlighter
    • Labels:
      None

      Description

      The Highlighter Component does not currently expose the snippet character offsets nor the score. There is a TODO in DefaultSolrHighlighter indicating the intention to add this eventually. This information is needed when doing highlighting on external content. The data is there so its pretty easy to output it in some way. The challenge is deciding on the output and its ramifications on backwards compatibility. The current highlighter component response structure doesn't lend itself to adding any new data, unfortunately. I wish the original implementer had some foresight. Unfortunately all the highlighting tests assume this structure. Here is a snippet of the current response structure in Solr's sample data searching for "sdram" for reference:

      <lst name="highlighting">
       <lst name="VS1GB400C3">
        <arr name="text">
      	<str>CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt; Unbuffered DDR 400 (PC 3200) System Memory - Retail</str>
      
        </arr>
       </lst>
      </lst>
      

      Perhaps as a little hack, we introduce a pseudo field called text_startCharOffset which is the concatenation of the matching field and "_startCharOffset". This would be an array of ints. Likewise, there would be another array for endCharOffset and score.

      Thoughts?

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          Implementation for the default highlighter. Includes a basic test.

          Show
          David Smiley added a comment - Implementation for the default highlighter. Includes a basic test.
          Hide
          Hoss Man added a comment -

          if the structure is poor and hard to add additional metadata to which would be beneficial to new users let's change it

          As long as there is an option people can turn on to force the legacy behavior there's nothing wrong with that.

          In it's simplest form we can just add a new Highlighting Component (with a different class name) that is registered by default as the component "highlight" and document in CHANGES.txt that if people need/want the old one they should modify their solrconfig.xml to register it explicitly .

          alternately we can keep using hte existing class, and modify it so that it changes it's behavior based on some init param, ditto previous comments about default behavior and CHANGES.txt

          (back compat should be easy on upgrade, but i'd rather tell existing users "add this one line to your config if you really need the exact same response structure instead of this new better structure" then tell new and existing users "this is the really klunky hoop you have to jump through to make sense of all this hot new data we are returning")

          Show
          Hoss Man added a comment - if the structure is poor and hard to add additional metadata to which would be beneficial to new users let's change it As long as there is an option people can turn on to force the legacy behavior there's nothing wrong with that. In it's simplest form we can just add a new Highlighting Component (with a different class name) that is registered by default as the component "highlight" and document in CHANGES.txt that if people need/want the old one they should modify their solrconfig.xml to register it explicitly . alternately we can keep using hte existing class, and modify it so that it changes it's behavior based on some init param, ditto previous comments about default behavior and CHANGES.txt (back compat should be easy on upgrade, but i'd rather tell existing users "add this one line to your config if you really need the exact same response structure instead of this new better structure" then tell new and existing users "this is the really klunky hoop you have to jump through to make sense of all this hot new data we are returning")
          Hide
          David Smiley added a comment -

          I think a different component is overkill for such a small change. I agree it should be toggled with a parameter.

          Since the majority of users don't care about this extra metadata, perhaps the existing structure should be retained for when it is not asked for. Nobody would have to change, not even the tests. No Solr clients would have to care, necessarily. And when it is asked for (a rare need), the structure would then change to accommodate it. This would brake any client logic expecting to find the existing snippet where it usually is because it wouldn't be there. If this is undesirable or unacceptable, then there's the field suffix method that I describe in this issue and which is implemented in the patch. The only danger is that a client should not assume that that all listed fields are arrays of strings since some will be arrays of ints or floats. My patch includes such a modification for SolrJ.

          Show
          David Smiley added a comment - I think a different component is overkill for such a small change. I agree it should be toggled with a parameter. Since the majority of users don't care about this extra metadata, perhaps the existing structure should be retained for when it is not asked for. Nobody would have to change, not even the tests. No Solr clients would have to care, necessarily. And when it is asked for (a rare need), the structure would then change to accommodate it. This would brake any client logic expecting to find the existing snippet where it usually is because it wouldn't be there. If this is undesirable or unacceptable, then there's the field suffix method that I describe in this issue and which is implemented in the patch. The only danger is that a client should not assume that that all listed fields are arrays of strings since some will be arrays of ints or floats. My patch includes such a modification for SolrJ.
          Hide
          Erik Hatcher added a comment -

          The spellcheck component changes format if you ask for extendedResults. I find that painful. Let's not repeat that here.

          One option is to put the additional information into a parallel data structure keyed by uniqueKey, just like the current highlighting data structure. So rather than within it, beside it. Make sense? I know, I know... we've been talking about moving away from these parallel keyed data structures, but it's really not difficult for a client to deal with.

          Show
          Erik Hatcher added a comment - The spellcheck component changes format if you ask for extendedResults. I find that painful. Let's not repeat that here. One option is to put the additional information into a parallel data structure keyed by uniqueKey, just like the current highlighting data structure. So rather than within it, beside it. Make sense? I know, I know... we've been talking about moving away from these parallel keyed data structures, but it's really not difficult for a client to deal with.
          Hide
          David Smiley added a comment -

          Erik, I think you're suggesting what I suggested. So we'd have:

            <arr name="text">
          	<str>CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt; Unbuffered DDR 400 (PC 3200) System Memory - Retail</str>
            </arr>
            <arr name="text_startPos">
                  <int>5</int>
            </arr>
          

          The leading problem is that clients can no longer expect that every child of the parent parent element here (which identifies the document matched) is necessarily an array of strings, because it could be an array of ints or floats. We're also overloading the field name namespace with suffixes thereby preventing this working if a fieldname actually ends with _startPos. This strikes me as no big deal.

          Show
          David Smiley added a comment - Erik, I think you're suggesting what I suggested. So we'd have: <arr name= "text" > <str> CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt; Unbuffered DDR 400 (PC 3200) System Memory - Retail </str> </arr> <arr name= "text_startPos" > <int> 5 </int> </arr> The leading problem is that clients can no longer expect that every child of the parent parent element here (which identifies the document matched) is necessarily an array of strings, because it could be an array of ints or floats. We're also overloading the field name namespace with suffixes thereby preventing this working if a fieldname actually ends with _startPos. This strikes me as no big deal.
          Hide
          Erik Hatcher added a comment - - edited

          No, we're not talking about the same thing. Here's what I'm suggesting:

          {
            'responseHeader'=>{
              'status'=>0,
              'QTime'=>15},
            'response'=>{'numFound'=>3,'start'=>0,'maxScore'=>0.10558263,'docs'=>[
                {
                  'id'=>'IW-02',
                  'name'=>'iPod & iPod Mini USB 2.0 Cable',
                  'manu'=>'Belkin',
                  'weight'=>2.0,
                  'price'=>11.5,
                  'popularity'=>1,
                  'inStock'=>false,
                  'store_0_d'=>37.7752,
                  'store_1_d'=>-122.4232,
                  'store'=>'37.7752,-122.4232',
                  'manufacturedate_dt'=>'2006-02-14T23:55:59Z',
                  'cat'=>[
                    'electronics',
                    'connector'],
                  'features'=>[
                    'car power adapter for iPod, white'],
                  'score'=>0.10558263}]
            },
            'highlighting'=>{
              'IW-02'=>{
                'features'=>['car power adapter for <em>iPod</em>, white'],
                'name'=>['<em>iPod</em> & <em>iPod</em> Mini USB 2.0 Cable']}},
            'highlighting-extended-info'=>{
              'IW-02'=>{
                'text_startPos'=>[5]
            }
          }
          

          That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info' (let's use a shorter name though) section as a direct child of the root response, just like 'highlighting' is.

          Show
          Erik Hatcher added a comment - - edited No, we're not talking about the same thing. Here's what I'm suggesting: { 'responseHeader'=>{ 'status'=>0, 'QTime'=>15}, 'response'=>{'numFound'=>3,'start'=>0,'maxScore'=>0.10558263,'docs'=>[ { 'id'=>'IW-02', 'name'=>'iPod & iPod Mini USB 2.0 Cable', 'manu'=>'Belkin', 'weight'=>2.0, 'price'=>11.5, 'popularity'=>1, 'inStock'=> false , 'store_0_d'=>37.7752, 'store_1_d'=>-122.4232, 'store'=>'37.7752,-122.4232', 'manufacturedate_dt'=>'2006-02-14T23:55:59Z', 'cat'=>[ 'electronics', 'connector'], 'features'=>[ 'car power adapter for iPod, white'], 'score'=>0.10558263}] }, 'highlighting'=>{ 'IW-02'=>{ 'features'=>['car power adapter for <em>iPod</em>, white'], 'name'=>['<em>iPod</em> & <em>iPod</em> Mini USB 2.0 Cable']}}, 'highlighting-extended-info'=>{ 'IW-02'=>{ 'text_startPos'=>[5] } } That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info' (let's use a shorter name though) section as a direct child of the root response, just like 'highlighting' is.
          Hide
          David Smiley added a comment -

          This seems like a good idea to me. How about naming it "highlighting-extended" ? I don't think the length of this name should matter.

          Show
          David Smiley added a comment - This seems like a good idea to me. How about naming it "highlighting-extended" ? I don't think the length of this name should matter.
          Hide
          Erik Hatcher added a comment -

          +1 to highlighting-extended

          Show
          Erik Hatcher added a comment - +1 to highlighting-extended
          Hide
          Yonik Seeley added a comment -

          The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters?

          Walter Underwood proposed a good idea of just alternating segments of text for highlighting. That would also avoid the broken style of highlighting we have now (it's ambiguous since a real <em> could be embedded in the text.

          Rather than adding more outer containers like 'highlighting-extended-info', we could change what "highlighting" contains based on a parameter (and perhaps even change the default if this is targeted toward trunk).

          Show
          Yonik Seeley added a comment - The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters? Walter Underwood proposed a good idea of just alternating segments of text for highlighting. That would also avoid the broken style of highlighting we have now (it's ambiguous since a real <em> could be embedded in the text. Rather than adding more outer containers like 'highlighting-extended-info', we could change what "highlighting" contains based on a parameter (and perhaps even change the default if this is targeted toward trunk).
          Hide
          Hoss Man added a comment -

          That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info'

          that really seems painful – i think it would be a lot better to just come up with what the "new" structure should look like that's more flexible, populated it with more/less data based on what param the user asks for (ie: hl.positions=true) and then make this new structure the default for all future versions of solr. Folks who don't want the new types of metadata, and don't want to change their clients to understand the new structure can add some param to their defaults to revert the format. this is how we've dealt with several other changes in the past where we want the "default" behavior to be differnet for new users, but still support the old behavior for legacy users

          (spellcheck.extendedResults may seem painful because it changes results – but that's because it was never intended for you to toggle it on differnet requsts – it's expected that you'll set it once and forget it – the real problem is that it probably should have been made the default)

          The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters?

          1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
          2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values – my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.

          Walter Underwood proposed a good idea of just alternating segments of text for highlighting.

          I like that idea, and if structured properly it can still include the "score" for each matching chunk as metadata, but some clients are still going to prefer offset metadata – in particular the situation where i've got a 20MB text file in external storage and i want display the entire document with matches highlighted. returning alternating strings isn't going to really going to help me unless they aren't truncated - at which point you are returning the entire 20MB doc (broken up in a bunch of distinct strings) instead of just returning a bunch of numbers i can use to find the corrisponding points in my local copy of the file.

          Show
          Hoss Man added a comment - That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info' that really seems painful – i think it would be a lot better to just come up with what the "new" structure should look like that's more flexible, populated it with more/less data based on what param the user asks for (ie: hl.positions=true) and then make this new structure the default for all future versions of solr. Folks who don't want the new types of metadata, and don't want to change their clients to understand the new structure can add some param to their defaults to revert the format. this is how we've dealt with several other changes in the past where we want the "default" behavior to be differnet for new users, but still support the old behavior for legacy users (spellcheck.extendedResults may seem painful because it changes results – but that's because it was never intended for you to toggle it on differnet requsts – it's expected that you'll set it once and forget it – the real problem is that it probably should have been made the default) The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters? 1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character? 2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values – my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes. Walter Underwood proposed a good idea of just alternating segments of text for highlighting. I like that idea, and if structured properly it can still include the "score" for each matching chunk as metadata, but some clients are still going to prefer offset metadata – in particular the situation where i've got a 20MB text file in external storage and i want display the entire document with matches highlighted. returning alternating strings isn't going to really going to help me unless they aren't truncated - at which point you are returning the entire 20MB doc (broken up in a bunch of distinct strings) instead of just returning a bunch of numbers i can use to find the corrisponding points in my local copy of the file.
          Hide
          Robert Muir added a comment -

          1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
          2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.

          Nope, a 'character' in java is utf-16, it cannot even hold a full unicode code point.
          In other programming languages that might be solr clients, characters and strings might be utf-8, or utf-32.
          So if offsets are to be returned, its necessary to specify what 'unit' they are measured in.
          Otherwise, an offset is as useless as saying my house is '4' away from yours... 4 what?!

          Show
          Robert Muir added a comment - 1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character? 2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes. Nope, a 'character' in java is utf-16, it cannot even hold a full unicode code point. In other programming languages that might be solr clients, characters and strings might be utf-8, or utf-32. So if offsets are to be returned, its necessary to specify what 'unit' they are measured in. Otherwise, an offset is as useless as saying my house is '4' away from yours... 4 what?!
          Hide
          Hoss Man added a comment -

          1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character?
          2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes.

          On IRC rmuir convinced me that the world is not this simple ... char offsets makes sense when dealing just with the java API, but once you send the data over the wire to a remote client, it might be using any arbitrary charset in memory, so the only possible unit you can use to convey position information that can be ignorant of the remote client is bytes.

          Show
          Hoss Man added a comment - 1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character? 2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes. On IRC rmuir convinced me that the world is not this simple ... char offsets makes sense when dealing just with the java API, but once you send the data over the wire to a remote client, it might be using any arbitrary charset in memory, so the only possible unit you can use to convey position information that can be ignorant of the remote client is bytes.
          Hide
          Hoss Man added a comment -

          whoops ... didn't notice rmuir had already commented.

          Show
          Hoss Man added a comment - whoops ... didn't notice rmuir had already commented.
          Hide
          Robert Muir added a comment -

          so the only possible unit you can use to convey position information that can be ignorant of the remote client is bytes.

          I think bytes are no better, for example:

          𩬅 is 6-bytes in UTF-16, 4-bytes in UTF-8, 2 chars in java.

          in other programming languages maybe its 1 char and 4-bytes, in java its 2 chars and 6 bytes.

          So this is why any offset numbers will be problematic, as they will invite bugs in client applications.

          Show
          Robert Muir added a comment - so the only possible unit you can use to convey position information that can be ignorant of the remote client is bytes. I think bytes are no better, for example: 𩬅 is 6-bytes in UTF-16, 4-bytes in UTF-8, 2 chars in java. in other programming languages maybe its 1 char and 4-bytes, in java its 2 chars and 6 bytes. So this is why any offset numbers will be problematic, as they will invite bugs in client applications.
          Hide
          David Smiley added a comment -

          Character offsets may not be perfect, but bytes are worse. And without either, there's no feature here to discuss! Hence lets keep it at UTF-16 character offset as a practical matter.

          Show
          David Smiley added a comment - Character offsets may not be perfect, but bytes are worse. And without either, there's no feature here to discuss! Hence lets keep it at UTF-16 character offset as a practical matter.
          Hide
          Robert Muir added a comment -

          And without either, there's no feature here to discuss! Hence lets keep it at UTF-16 character offset as a practical matter.

          I don't necessarily agree. I would like to see, for example, correct php or perl code that consumes these offsets. I argue anyone trying to use utf16 offsets from anything but java will likely have bugs in their client application.

          It would be good to consider alternative ways to implement this without using integer offsets (for example, what yonik proposed)

          Show
          Robert Muir added a comment - And without either, there's no feature here to discuss! Hence lets keep it at UTF-16 character offset as a practical matter. I don't necessarily agree. I would like to see, for example, correct php or perl code that consumes these offsets. I argue anyone trying to use utf16 offsets from anything but java will likely have bugs in their client application. It would be good to consider alternative ways to implement this without using integer offsets (for example, what yonik proposed)
          Hide
          Jamie Johnson added a comment - - edited

          I know this has been awhile, but I had a need for something like this and while I implemented (and added it to SOLR-1397) I figured I'd try this out instead as well. After applying the patch I got the following response

          <lst name="highlighting">
          <lst name="1">
          <arr name="subject_phonetic">
          <str><em>Test</em> subject message</str>
          </arr>
          <arr name="subject_phonetic_startPos"><int>0</int></arr>
          <arr name="subject_phonetic_endPos"><int>29</int></arr>
          </lst>
          </lst>

          seems that the startPos is always 0 and endPos is the length of the field including the highlighting start/end tags. Is this expected?

          Show
          Jamie Johnson added a comment - - edited I know this has been awhile, but I had a need for something like this and while I implemented (and added it to SOLR-1397 ) I figured I'd try this out instead as well. After applying the patch I got the following response <lst name="highlighting"> <lst name="1"> <arr name="subject_phonetic"> <str><em>Test</em> subject message</str> </arr> <arr name="subject_phonetic_startPos"><int>0</int></arr> <arr name="subject_phonetic_endPos"><int>29</int></arr> </lst> </lst> seems that the startPos is always 0 and endPos is the length of the field including the highlighting start/end tags. Is this expected?
          Hide
          David Smiley added a comment -

          Jamie; it's been a long time since I've touched this. I am not sure why you see the behavior you see; it is not expected at all.

          Show
          David Smiley added a comment - Jamie; it's been a long time since I've touched this. I am not sure why you see the behavior you see; it is not expected at all.
          Hide
          Fabian Deutsch added a comment -

          Good day, is there any progress on this or anything needed to get this patch upstream?

          Show
          Fabian Deutsch added a comment - Good day, is there any progress on this or anything needed to get this patch upstream?

            People

            • Assignee:
              Unassigned
              Reporter:
              David Smiley
            • Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development