Lucene - Core
  1. Lucene - Core
  2. LUCENE-6334

Fast Vector Highlighter does not properly span neighboring term offsets

    Details

    • Lucene Fields:
      New

      Description

      If you are using term vectors for fast vector highlighting along with a multivalue field while matching a phrase that crosses two elements, then it will not properly highlight even though it properly finds the correct values to highlight.

      A good example of this is when matching source code, where you might have lines like:

      one two three five
      two three four
      five six five
      six seven eight nine eight nine eight nine eight nine eight nine eight nine
      eight nine
      ten eleven
      twelve thirteen
      

      Matching the phrase "four five" will return

      two three four
      five six five
      six seven eight nine eight nine eight nine eight nine eight
      eight nine
      ten eleven
      

      However, it does not properly highlight "four" (on the first line) and "five" (on the second line) and it is returning too many lines, but not all of them.

      The problem lies in the BaseFragmentsBuilder at line 269 because it is not checking for cross-coverage. Here is a possible solution:

      boolean started = toffs.getStartOffset() >= fieldStart;
      boolean ended = toffs.getEndOffset() <= fieldEnd;
      
      // existing behavior:
      if (started && ended) {
          toffsList.add(toffs);
          toffsIterator.remove();
      }
      else if (started) {
          toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
          // toffsIterator.remove(); // is this necessary?
      }
      else if (ended) {
          toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
          // toffsIterator.remove(); // is this necessary?
      }
      else if (toffs.getEndOffset() > fieldEnd) {
          // ie the toff spans whole field
          toffsList.add(new Toffs(fieldStart, fieldEnd));
          // toffsIterator.remove(); // is this necessary?
      }
      

        Activity

        Hide
        Michael McCandless added a comment -

        Thanks Chris Earle, could you boil this into a small test case showing the issue? You can model it after on of the existing tests... and then use "svn diff" to make a patch with that test and the proposed fix? Thanks!

        Show
        Michael McCandless added a comment - Thanks Chris Earle , could you boil this into a small test case showing the issue? You can model it after on of the existing tests... and then use "svn diff" to make a patch with that test and the proposed fix? Thanks!
        Hide
        Michael McCandless added a comment -

        Vijay Kamabathula yes please!

        Show
        Michael McCandless added a comment - Vijay Kamabathula yes please!
        Hide
        Nik Everett added a comment -

        Would anyone object to me having a look at this?

        Show
        Nik Everett added a comment - Would anyone object to me having a look at this?
        Hide
        Nik Everett added a comment -

        Test case and fix based on examples and source code provided in problem description. I started with the proposed fix and modified it quite a bit to get something that should get the job done. Also expanded on the proposed test cases to include things like phrases that span entire values.

        Show
        Nik Everett added a comment - Test case and fix based on examples and source code provided in problem description. I started with the proposed fix and modified it quite a bit to get something that should get the job done. Also expanded on the proposed test cases to include things like phrases that span entire values.
        Hide
        Michael McCandless added a comment -

        Thanks Nik Everett, patch looks good, I'll commit shortly...

        Show
        Michael McCandless added a comment - Thanks Nik Everett , patch looks good, I'll commit shortly...
        Hide
        ASF subversion and git services added a comment -

        Commit 1693155 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1693155 ]

        LUCENE-6334: fix FastVectorHighlighter when a phrase spans more than one value in a multi-valued field

        Show
        ASF subversion and git services added a comment - Commit 1693155 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1693155 ] LUCENE-6334 : fix FastVectorHighlighter when a phrase spans more than one value in a multi-valued field
        Hide
        ASF subversion and git services added a comment -

        Commit 1693156 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1693156 ]

        LUCENE-6334: fix FastVectorHighlighter when a phrase spans more than one value in a multi-valued field

        Show
        ASF subversion and git services added a comment - Commit 1693156 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1693156 ] LUCENE-6334 : fix FastVectorHighlighter when a phrase spans more than one value in a multi-valued field
        Hide
        Michael McCandless added a comment -
        Show
        Michael McCandless added a comment - Thanks Chris Earle and Nik Everett !
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Chris Earle
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development