Solr
  1. Solr
  2. SOLR-556

Highlighting of multi-valued fields returns snippets which span multiple different values

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: highlighter
    • Labels:
      None
    • Environment:

      Tomcat 5.5

      Description

      When highlighting multi-valued fields, the highlighter sometimes returns snippets which span multiple values, e.g. with values "foo" and "bar" and search term "ba" the highlighter will create the snippet "foo<em>ba</em>r". Furthermore it sometimes returns smaller snippets than it should, e.g. with value "foobar" and search term "oo" it will create the snippet "<em>oo</em>" regardless of hl.fragsize.

      I have been unable to determine the real cause for this, or indeed what actually goes on at all. To reproduce the problem, I've used the following steps:

      • create an index with multi-valued fields, one document should have at least 3 values for these fields (in my case strings of length between 5 and 15 Japanese characters – as far as I can tell plain old ASCII should produce the same effect though)
      • search for part of a value in such a field with highlighting enabled, the additional parameters I use are hl.fragsize=70, hl.requireFieldMatch=true, hl.mergeContiguous=true (changing the parameters does not seem to have any effect on the result though)
      • highlighted snippets should show effects described above

        Issue Links

          Activity

          Hide
          Lars Kotthoff added a comment -

          Patch against SVN HEAD to treat multi valued fields like single valued fields when highlighting by looping over the field values and accumulating the highlighted snippets.

          This corrects the behaviour I've described and simplifies the code. The downside is that it may impose a performance penalty for large numbers of snippets. The code breaks out of the loop when enough snippets have been found without considering the other values of the fields, which means that the returned snippets may not be the best ones.

          Show
          Lars Kotthoff added a comment - Patch against SVN HEAD to treat multi valued fields like single valued fields when highlighting by looping over the field values and accumulating the highlighted snippets. This corrects the behaviour I've described and simplifies the code. The downside is that it may impose a performance penalty for large numbers of snippets. The code breaks out of the loop when enough snippets have been found without considering the other values of the fields, which means that the returned snippets may not be the best ones.
          Hide
          Lars Kotthoff added a comment -

          The previous patch caused ArrayIndexOutOfBoundsExceptions in some cases. This is corrected with this patch.

          Show
          Lars Kotthoff added a comment - The previous patch caused ArrayIndexOutOfBoundsExceptions in some cases. This is corrected with this patch.
          Hide
          Otis Gospodnetic added a comment -

          Lars - could you please try the patch in SOLR-553 and see if it fixes the problem you described here?

          Show
          Otis Gospodnetic added a comment - Lars - could you please try the patch in SOLR-553 and see if it fixes the problem you described here?
          Hide
          Lars Kotthoff added a comment -

          I've applied SOLR-553 and confirmed that this problem is not fixed, regardless of the setting of usePhraseHighlighter.

          Show
          Lars Kotthoff added a comment - I've applied SOLR-553 and confirmed that this problem is not fixed, regardless of the setting of usePhraseHighlighter.
          Hide
          Lars Kotthoff added a comment -

          Attaching test file with example document, relevant part of schema.xml and example query.

          Show
          Lars Kotthoff added a comment - Attaching test file with example document, relevant part of schema.xml and example query.
          Hide
          Mike Klaas added a comment -

          Thanks for the report, Lars. I'll take a look at this shortly.

          Show
          Mike Klaas added a comment - Thanks for the report, Lars. I'll take a look at this shortly.
          Hide
          Lars Kotthoff added a comment -

          Attaching new patch which checks the fragments returned by Lucene properly. The getBestTextFragments method sometimes returns fragments which do not contain the search term at all (with a score of 0), which I wasn't aware of. The new patch checks whether the score of a fragment is > 0 before adding it to the list of fragments. It also removes the intermediate list of TextFragments and only accumulates a list of Strings, since this is the only information that's returned anyway.

          With the new patch the unit tests (which I finally convinced to run at all in Eclipse) succeed.

          Show
          Lars Kotthoff added a comment - Attaching new patch which checks the fragments returned by Lucene properly. The getBestTextFragments method sometimes returns fragments which do not contain the search term at all (with a score of 0), which I wasn't aware of. The new patch checks whether the score of a fragment is > 0 before adding it to the list of fragments. It also removes the intermediate list of TextFragments and only accumulates a list of Strings, since this is the only information that's returned anyway. With the new patch the unit tests (which I finally convinced to run at all in Eclipse) succeed.
          Hide
          Lars Kotthoff added a comment -

          Sorry for all the noise. Attaching yet another new version of the patch which makes sure that for multi-valued field the best fragments are returned, rather than the first fragments with score > 0 as before.

          Added a unit test to verify this behaviour.

          Show
          Lars Kotthoff added a comment - Sorry for all the noise. Attaching yet another new version of the patch which makes sure that for multi-valued field the best fragments are returned, rather than the first fragments with score > 0 as before. Added a unit test to verify this behaviour.
          Hide
          Mike Klaas added a comment -

          Ah, I see what the problem is: Although it is impossible for tokens from different values to appear in the same fragment (due to the semantics of MultiValuedTokenFilter), the non-token text (typically, punctuation) from different values can bleed into the same fragment, since lucene's highlighter can only create a new fragment on token boundaries.

          Unfortunately SOLR-553 was committed a day after you submitted your patch, and rearranges the code slightly so that it no longer applies. Could you sync the patch with trunk? I think the basic approach is sound.

          Show
          Mike Klaas added a comment - Ah, I see what the problem is: Although it is impossible for tokens from different values to appear in the same fragment (due to the semantics of MultiValuedTokenFilter), the non-token text (typically, punctuation) from different values can bleed into the same fragment, since lucene's highlighter can only create a new fragment on token boundaries. Unfortunately SOLR-553 was committed a day after you submitted your patch, and rearranges the code slightly so that it no longer applies. Could you sync the patch with trunk? I think the basic approach is sound.
          Hide
          Lars Kotthoff added a comment -

          Attaching new patch which should apply cleanly to current HEAD. Also verified that it doesn't break anything fixed in SOLR-553 by running the unit test.

          Show
          Lars Kotthoff added a comment - Attaching new patch which should apply cleanly to current HEAD. Also verified that it doesn't break anything fixed in SOLR-553 by running the unit test.
          Hide
          Mike Klaas added a comment -

          Thanks for the patch, Lars. I think that the basic approach is sound, though I am a little nervous about the performance implications (especially in the case of phrase highlighting, where we spin up an entirely new spanhighlighter for each value in a multi-valued field). I wonder if I am the only one who highlights large text fields composed of dozens of individual values?

          Show
          Mike Klaas added a comment - Thanks for the patch, Lars. I think that the basic approach is sound, though I am a little nervous about the performance implications (especially in the case of phrase highlighting, where we spin up an entirely new spanhighlighter for each value in a multi-valued field). I wonder if I am the only one who highlights large text fields composed of dozens of individual values?
          Hide
          Lars Kotthoff added a comment -

          In the setup I've been testing it with (one large single-valued "text" field and several multi-valued fields) it didn't seem to have any serious performance implications – i.e. the randomness of my test queries was enough to mask any loss of performance
          The length of the multi-valued fields is in the order of 10-20 characters on average though and there're not many multiple different values.

          I personally think that returning correct data is more important than performance in this case, but that may just be because my particular setup doesn't suffer any significant loss of performance. I didn't see any other way to correct the behaviour of the current trunk code, but if anybody else has a better idea how to do it, please let us know!

          Show
          Lars Kotthoff added a comment - In the setup I've been testing it with (one large single-valued "text" field and several multi-valued fields) it didn't seem to have any serious performance implications – i.e. the randomness of my test queries was enough to mask any loss of performance The length of the multi-valued fields is in the order of 10-20 characters on average though and there're not many multiple different values. I personally think that returning correct data is more important than performance in this case, but that may just be because my particular setup doesn't suffer any significant loss of performance. I didn't see any other way to correct the behaviour of the current trunk code, but if anybody else has a better idea how to do it, please let us know!
          Hide
          Mike Klaas added a comment -

          Hey Lars,

          Yeah, I'm talking about highlighting 15kB of text in 100-200 character chunks. Maybe I can whip up a perf test for this soon.

          The reason we probably see this issue differently is that the incorrect behaviour is quite minor for most users (perhaps a bit of punctuation leaking from value to value at most). Once way to correct what you are seeing is to use a tokenizer that creates tokens out of the CJK characters, or things on boundaries. In your case, inserting a fake token when encountering a right bracket [)] would fix the problem, I think.

          Nevertheless, I think I will probably end up committing your patch after pondering it some more.

          Show
          Mike Klaas added a comment - Hey Lars, Yeah, I'm talking about highlighting 15kB of text in 100-200 character chunks. Maybe I can whip up a perf test for this soon. The reason we probably see this issue differently is that the incorrect behaviour is quite minor for most users (perhaps a bit of punctuation leaking from value to value at most). Once way to correct what you are seeing is to use a tokenizer that creates tokens out of the CJK characters, or things on boundaries. In your case, inserting a fake token when encountering a right bracket [)] would fix the problem, I think. Nevertheless, I think I will probably end up committing your patch after pondering it some more.
          Hide
          Lars Kotthoff added a comment -

          Hi Mike,

          In my opinion the most important reason for committing the patch is that the current implementation breaks the multi-valued field abstraction. There's no way to assert that tokenizers will always produce tokens suitable for the current implementation. It also makes for a very hard to find bug, because there're no error messages. I just found it by chance. And even if you notice that something is wrong, fixing it is non-trivial and requires quite some knowlegde how Solr does highlighting of multi-valued fields.

          So the other option is to add a page to the wiki with a workaround like you've suggested, but I think that's rather going to scare people evaluating Solr for use with CJK text away

          Show
          Lars Kotthoff added a comment - Hi Mike, In my opinion the most important reason for committing the patch is that the current implementation breaks the multi-valued field abstraction. There's no way to assert that tokenizers will always produce tokens suitable for the current implementation. It also makes for a very hard to find bug, because there're no error messages. I just found it by chance. And even if you notice that something is wrong, fixing it is non-trivial and requires quite some knowlegde how Solr does highlighting of multi-valued fields. So the other option is to add a page to the wiki with a workaround like you've suggested, but I think that's rather going to scare people evaluating Solr for use with CJK text away
          Hide
          Mike Klaas added a comment -

          committed as part of SOLR-610. thanks Lars!

          Show
          Mike Klaas added a comment - committed as part of SOLR-610 . thanks Lars!

            People

            • Assignee:
              Mike Klaas
              Reporter:
              Lars Kotthoff
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development