Lucene - Core
  1. Lucene - Core
  2. LUCENE-1822

FastVectorHighlighter: SimpleFragListBuilder hard-coded 6 char margin is too naive

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      any

    • Lucene Fields:
      New, Patch Available

      Description

      The new FastVectorHighlighter performs extremely well, however I've found in testing that the window of text chosen per fragment is often very poor, as it is hard coded in SimpleFragListBuilder to always select starting 6 characters to the left of the first phrase match in a fragment. When selecting long fragments, this often means that there is barely any context before the highlighted word, and lots after; even worse, when highlighting a phrase at the end of a short text the beginning is cut off, even though the entire phrase would fit in the specified fragCharSize. For example, highlighting "Punishment" in "Crime and Punishment" returns "e and <b>Punishment</b>" no matter what fragCharSize is specified. I am going to attach a patch that improves the text window selection by recalculating the starting margin once all phrases in the fragment have been identified - this way if a single word is matched in a fragment, it will appear in the middle of the highlight, instead of 6 characters from the beginning. This way one can also guarantee that the entirety of short texts are represented in a fragment by specifying a large enough fragCharSize.

      1. LUCENE-1822.patch
        9 kB
        Koji Sekiguchi
      2. LUCENE-1822.patch
        2 kB
        Koji Sekiguchi
      3. LUCENE-1822.patch
        2 kB
        Alex Vigdor
      4. LUCENE-1822-tests.patch
        7 kB
        Arcadius Ahouansou

        Issue Links

          Activity

          Hide
          Chas Emerick added a comment -

          Thank you for the patch. I agree, the context surrounding each fragment could definitely be improved.

          Show
          Chas Emerick added a comment - Thank you for the patch. I agree, the context surrounding each fragment could definitely be improved.
          Hide
          Arcadius Ahouansou added a comment - - edited

          We are hitting this issue with Solr 3.5 (i.e. our result title is being truncated when the search term is close to the end of the title).

          It is quite critical for us.

          This is forcing many to implement their own highlighting.

          Thanks.

          Show
          Arcadius Ahouansou added a comment - - edited We are hitting this issue with Solr 3.5 (i.e. our result title is being truncated when the search term is close to the end of the title). It is quite critical for us. This is forcing many to implement their own highlighting. Thanks.
          Hide
          Koji Sekiguchi added a comment -

          Updated the patch for current trunk.

          Show
          Koji Sekiguchi added a comment - Updated the patch for current trunk.
          Hide
          Koji Sekiguchi added a comment -

          I got errors when I do ant test at lucene/highlighter/ .

          Can someone take this ticket? Sorry, but I don't have time to look into it for a while.

          Show
          Koji Sekiguchi added a comment - I got errors when I do ant test at lucene/highlighter/ . Can someone take this ticket? Sorry, but I don't have time to look into it for a while.
          Hide
          Arcadius Ahouansou added a comment -

          Hi Koji.

          Thanks for the patch update.
          The failing tests have been fixed.
          Some are obvious.
          For the tests checking for subInfo,
          we have something like
          exptected: subInfos=(theboth((195,203)))/0.86791086(189,289)
          actual : subInfos=(theboth((195,203)))/0.86791086(149,249)

          Honestly, I haven't got into the detail of verifying/counting the offset positions for the search terms.

          Could you have a look please?

          Thanks.

          Show
          Arcadius Ahouansou added a comment - Hi Koji. Thanks for the patch update. The failing tests have been fixed. Some are obvious. For the tests checking for subInfo, we have something like exptected: subInfos=(theboth((195,203)))/0.86791086(189,289) actual : subInfos=(theboth((195,203)))/0.86791086(149,249) Honestly, I haven't got into the detail of verifying/counting the offset positions for the search terms. Could you have a look please? Thanks.
          Hide
          Koji Sekiguchi added a comment -

          Thank you for updating tests, Arcadius!

          I'll commit tomorrow if no one objects.

          Show
          Koji Sekiguchi added a comment - Thank you for updating tests, Arcadius! I'll commit tomorrow if no one objects.
          Hide
          Arcadius Ahouansou added a comment -

          Hi Koji.

          In the file
          lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/SimpleFragmentsBuilderTest.java

          I have added a TODO regarding empty spaces being added.

          Is this something we should deal with?

          Thanks.

          Show
          Arcadius Ahouansou added a comment - Hi Koji. In the file lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/SimpleFragmentsBuilderTest.java I have added a TODO regarding empty spaces being added. Is this something we should deal with? Thanks.
          Hide
          Koji Sekiguchi added a comment -

          Is this something we should deal with?

          I don't think so. The cause is because there are empty values in multi valued field in the indexed test data:

          protected static final String[] shortMVValues = {
            "",
            "",
            "a b c",
            "",   // empty data in multi valued field
            "d e"
          };
          

          and these spaces used to be not trimmed before applying the patch. We can open another ticket for trimming spaces if needed. Thanks for notifying me it anyway, Arcadius.

          I'll commit shortly.

          Show
          Koji Sekiguchi added a comment - Is this something we should deal with? I don't think so. The cause is because there are empty values in multi valued field in the indexed test data: protected static final String [] shortMVValues = { "", "", "a b c" , "", // empty data in multi valued field "d e" }; and these spaces used to be not trimmed before applying the patch. We can open another ticket for trimming spaces if needed. Thanks for notifying me it anyway, Arcadius. I'll commit shortly.
          Hide
          Koji Sekiguchi added a comment -

          trunk: Committed revision 1395847.
          branch_4x: Committed revision 1395848.

          Show
          Koji Sekiguchi added a comment - trunk: Committed revision 1395847. branch_4x: Committed revision 1395848.
          Hide
          Simon Willnauer added a comment -

          Hey Koji,

          I just tracked down a changed behavior to this issue. I think this is a major change in Runtime Behavior / BW Compatibility but I only see this listed as a bugfix with almost no info attached in CHANGES.txt. I think we should really document this change here in the CHANGES.TXT file since a lot of users might be affected. Don't get me wrong I think this change is a very good change and makes the behavior more intuitive but I really spend a long time to figure out why my tests failed.

          Show
          Simon Willnauer added a comment - Hey Koji, I just tracked down a changed behavior to this issue. I think this is a major change in Runtime Behavior / BW Compatibility but I only see this listed as a bugfix with almost no info attached in CHANGES.txt. I think we should really document this change here in the CHANGES.TXT file since a lot of users might be affected. Don't get me wrong I think this change is a very good change and makes the behavior more intuitive but I really spend a long time to figure out why my tests failed.
          Hide
          Koji Sekiguchi added a comment -

          Uh, Simon, sorry for my lack of prudence.

          Show
          Koji Sekiguchi added a comment - Uh, Simon, sorry for my lack of prudence.
          Hide
          Simon Willnauer added a comment -

          Koji, not a big deal. I just wonder if we should add more infos to CHANGES.TXT so others will not run into the same problems.

          Show
          Simon Willnauer added a comment - Koji, not a big deal. I just wonder if we should add more infos to CHANGES.TXT so others will not run into the same problems.
          Hide
          Koji Sekiguchi added a comment -

          Sure, that's great. Do you have a draft note?

          Show
          Koji Sekiguchi added a comment - Sure, that's great. Do you have a draft note?
          Hide
          Simon Willnauer added a comment -

          Sure, that's great. Do you have a draft note?

          not really I still need to figure out what this change here exactly added. Seems like it made the fragments "center" the highlight?

          Show
          Simon Willnauer added a comment - Sure, that's great. Do you have a draft note? not really I still need to figure out what this change here exactly added. Seems like it made the fragments "center" the highlight?
          Hide
          Koji Sekiguchi added a comment -

          Here is the patch for trunk. I think the original reporter Alex describes the heart of the problem very well, so I borrow the description.

          $ svn diff
          Index: lucene/CHANGES.txt
          ===================================================================
          --- lucene/CHANGES.txt	(revision 1437783)
          +++ lucene/CHANGES.txt	(working copy)
          @@ -414,6 +414,13 @@
             This only affects requests with depth>1. If you execute such requests and
             rely on the facet results being returned flat (i.e. no hierarchy), you should
             set the ResultMode to GLOBAL_FLAT. (Shai Erera, Gilad Barkai) 
          +
          +* LUCENE-1822: Improves the text window selection by recalculating the starting margin
          +  once all phrases in the fragment have been identified in FastVectorHighlighter. This
          +  way if a single word is matched in a fragment, it will appear in the middle of the highlight,
          +  instead of 6 characters from the beginning. This way one can also guarantee that
          +  the entirety of short texts are represented in a fragment by specifying a large
          +  enough fragCharSize.
             
           Optimizations
           
          
          Show
          Koji Sekiguchi added a comment - Here is the patch for trunk. I think the original reporter Alex describes the heart of the problem very well, so I borrow the description. $ svn diff Index: lucene/CHANGES.txt =================================================================== --- lucene/CHANGES.txt (revision 1437783) +++ lucene/CHANGES.txt (working copy) @@ -414,6 +414,13 @@ This only affects requests with depth>1. If you execute such requests and rely on the facet results being returned flat (i.e. no hierarchy), you should set the ResultMode to GLOBAL_FLAT. (Shai Erera, Gilad Barkai) + +* LUCENE-1822: Improves the text window selection by recalculating the starting margin + once all phrases in the fragment have been identified in FastVectorHighlighter. This + way if a single word is matched in a fragment, it will appear in the middle of the highlight, + instead of 6 characters from the beginning. This way one can also guarantee that + the entirety of short texts are represented in a fragment by specifying a large + enough fragCharSize. Optimizations
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Koji Sekiguchi
          http://svn.apache.org/viewvc?view=revision&revision=1438822

          LUCENE-1822: add a note in Changes in runtime behavior

          Show
          Commit Tag Bot added a comment - [trunk commit] Koji Sekiguchi http://svn.apache.org/viewvc?view=revision&revision=1438822 LUCENE-1822 : add a note in Changes in runtime behavior
          Hide
          Koji Sekiguchi added a comment -

          I committed the above note to trunk, branch_4x and lucene_solr_4_1.

          Show
          Koji Sekiguchi added a comment - I committed the above note to trunk, branch_4x and lucene_solr_4_1.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Koji Sekiguchi
          http://svn.apache.org/viewvc?view=revision&revision=1438824

          LUCENE-1822: add a note in Changes in runtime behavior

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Koji Sekiguchi http://svn.apache.org/viewvc?view=revision&revision=1438824 LUCENE-1822 : add a note in Changes in runtime behavior
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Koji Sekiguchi
          http://svn.apache.org/viewvc?view=revision&revision=1395848

          LUCENE-1822: BaseFragListBuilder hard-coded 6 char margin is too naive.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Koji Sekiguchi http://svn.apache.org/viewvc?view=revision&revision=1395848 LUCENE-1822 : BaseFragListBuilder hard-coded 6 char margin is too naive.
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Koji Sekiguchi
              Reporter:
              Alex Vigdor
            • Votes:
              4 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development