Lucene - Core
  1. Lucene - Core
  2. LUCENE-1824

FastVectorHighlighter truncates words at beginning and end of fragments

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      any

    • Lucene Fields:
      New, Patch Available

      Description

      FastVectorHighlighter does not take word boundaries into consideration when building fragments, so that in most cases the first and last word of a fragment are truncated. This makes the highlights less legible than they should be. I will attach a patch to BaseFragmentBuilder that resolves this by expanding the start and end boundaries of the fragment to the first whitespace character on either side of the fragment, or the beginning or end of the source text, whichever comes first. This significantly improves legibility, at the cost of returning a slightly larger number of characters than specified for the fragment size.

      1. LUCENE-1824.patch
        22 kB
        Koji Sekiguchi
      2. LUCENE-1824.patch
        21 kB
        Koji Sekiguchi
      3. LUCENE-1824.patch
        22 kB
        Koji Sekiguchi
      4. LUCENE-1824.patch
        17 kB
        Koji Sekiguchi
      5. LUCENE-1824.patch
        12 kB
        Koji Sekiguchi
      6. LUCENE-1824.patch
        9 kB
        Alex Vigdor

        Issue Links

          Activity

          Hide
          Michael Busch added a comment -

          Could you add a small junit that tests this (i.e. fails without the patch), Alex?

          Show
          Michael Busch added a comment - Could you add a small junit that tests this (i.e. fails without the patch), Alex?
          Hide
          Alex Vigdor added a comment - - edited

          Actually a couple of the existing tests specifically check for the faulty behavior - the attached patch for SimpleFragmentsBuilderTest tests for the non-truncating behavior implemented in the patch. For example, where the prior test looked for "ssing <b>speed</b>", it now looks for " processing <b>speed</b>". While fixing the tests I noticed an off-by-1 error in the orginal patch, which I have updated.

          Show
          Alex Vigdor added a comment - - edited Actually a couple of the existing tests specifically check for the faulty behavior - the attached patch for SimpleFragmentsBuilderTest tests for the non-truncating behavior implemented in the patch. For example, where the prior test looked for "ssing <b>speed</b>", it now looks for " processing <b>speed</b>". While fixing the tests I noticed an off-by-1 error in the orginal patch, which I have updated.
          Hide
          Michael Busch added a comment -

          ScoreOrderFragmentsBuilderTest.test3Frags() fails after applying your patches.

          Show
          Michael Busch added a comment - ScoreOrderFragmentsBuilderTest.test3Frags() fails after applying your patches.
          Hide
          Koji Sekiguchi added a comment -

          Alex,
          I don't have much time to look into this patch but I understand the requirement.
          Why I named Simple FragmentsBuilder because it simply makes fragments without concern for boundaries. I designed FragmentsBuilder can be pluggable, so that any other FragmentsBuilders can be written/contributed, e.g. WhitespaceFragmentsBuilder, SentenceAwareFragmentsBuilder, etc. I think adding new FragmentsBuilders (plus test cases) is better than modifying existing FragmentsBuilders. Don't forget that some languages (CJK) don't use period or whitespace for boundaries of words/sentences when you write new FragmentsBuilders.

          Show
          Koji Sekiguchi added a comment - Alex, I don't have much time to look into this patch but I understand the requirement. Why I named Simple FragmentsBuilder because it simply makes fragments without concern for boundaries. I designed FragmentsBuilder can be pluggable, so that any other FragmentsBuilders can be written/contributed, e.g. WhitespaceFragmentsBuilder, SentenceAwareFragmentsBuilder, etc. I think adding new FragmentsBuilders (plus test cases) is better than modifying existing FragmentsBuilders. Don't forget that some languages (CJK) don't use period or whitespace for boundaries of words/sentences when you write new FragmentsBuilders.
          Hide
          Michael Busch added a comment -

          I think we should exclude this from 2.9, as were getting very close to the code freeze.

          With the current approach tests are failing, and I agree with Koji that new functionality like this can and should be added as a new FragmentBuilder.

          Show
          Michael Busch added a comment - I think we should exclude this from 2.9, as were getting very close to the code freeze. With the current approach tests are failing, and I agree with Koji that new functionality like this can and should be added as a new FragmentBuilder.
          Hide
          Alex Vigdor added a comment -

          The failing test was due to an extra whitespace character at the beginning of the output, which I think is insignificant.

          However, I appreciate that the whitespace approach will not work for CJK, so I have moved my modifications to a new WhitespaceFragmentBuilder class and associated test class. The updated patch now contains just these two new classes and no modifications to other code.

          I don't want to hold up the release of 2.9, but anyone attempting to use the SimpleFragmentsBuilder with latin languages, or others that use whitespace to delimit words, will be dismayed by the rampant truncation!

          Show
          Alex Vigdor added a comment - The failing test was due to an extra whitespace character at the beginning of the output, which I think is insignificant. However, I appreciate that the whitespace approach will not work for CJK, so I have moved my modifications to a new WhitespaceFragmentBuilder class and associated test class. The updated patch now contains just these two new classes and no modifications to other code. I don't want to hold up the release of 2.9, but anyone attempting to use the SimpleFragmentsBuilder with latin languages, or others that use whitespace to delimit words, will be dismayed by the rampant truncation!
          Hide
          Jon Druse added a comment -

          Has this had any progress? I'm dealing with the same issues. Or is there a workaround? Thanks!

          Show
          Jon Druse added a comment - Has this had any progress? I'm dealing with the same issues. Or is there a workaround? Thanks!
          Hide
          Mark Miller added a comment -

          This has 4 votes and 5 watchers - is it ready to go in?

          Show
          Mark Miller added a comment - This has 4 votes and 5 watchers - is it ready to go in?
          Hide
          Robert Muir added a comment -

          just an idea: it seems like using a breakiterator would be the way to go here.

          Show
          Robert Muir added a comment - just an idea: it seems like using a breakiterator would be the way to go here.
          Hide
          Jahangir Anwari added a comment -

          Is there any chance of the patch being applied to the 3.x branch?

          Show
          Jahangir Anwari added a comment - Is there any chance of the patch being applied to the 3.x branch?
          Hide
          Trey Hyde added a comment -

          As of 3.2 the standard highlighter will use vectors if present (as thus gain a nice speed boost)... I'm not sure the FastVectorHighlighter has any more value. I'm "unvoting" this issue myself.

          Show
          Trey Hyde added a comment - As of 3.2 the standard highlighter will use vectors if present (as thus gain a nice speed boost)... I'm not sure the FastVectorHighlighter has any more value. I'm "unvoting" this issue myself.
          Hide
          Koji Sekiguchi added a comment -

          First draft. I introduced BoundaryScanner interface and two implementations of the interface, Simple and BreakIterator.

          SimpleBoundaryScanner uses the following default boundary chars:

          public static final Character[] DEFAULT_BOUNDARY_CHARS = {'.', ',', '!', '?', '(', '[', '{', '\t', '\n'};
          

          And they are used by SimpleBoundaryScanner to find word/sentence boundary.

          BreakIteratorBoundaryScanner can also be used to find the break of char/word/sentence/line.

          I made BaseFragmentsBuilder boundary-aware, rather than creating a new FragmentsBuilder something like BoundaryAwareFragmentsBuilder. As a result, all FragmentsBuilder is now boundary-aware natively, as long as using an appropriate BoundaryScanner.

          I've not touched test yet. Because this patch changes fragments boundaries, the existing test should go fail!

          Show
          Koji Sekiguchi added a comment - First draft. I introduced BoundaryScanner interface and two implementations of the interface, Simple and BreakIterator. SimpleBoundaryScanner uses the following default boundary chars: public static final Character [] DEFAULT_BOUNDARY_CHARS = {'.', ',', '!', '?', '(', '[', '{', '\t', '\n'}; And they are used by SimpleBoundaryScanner to find word/sentence boundary. BreakIteratorBoundaryScanner can also be used to find the break of char/word/sentence/line. I made BaseFragmentsBuilder boundary-aware, rather than creating a new FragmentsBuilder something like BoundaryAwareFragmentsBuilder. As a result, all FragmentsBuilder is now boundary-aware natively, as long as using an appropriate BoundaryScanner. I've not touched test yet. Because this patch changes fragments boundaries, the existing test should go fail!
          Hide
          Koji Sekiguchi added a comment -

          Forgot one comment. I've not taken care of Solr yet in the patch.

          Show
          Koji Sekiguchi added a comment - Forgot one comment. I've not taken care of Solr yet in the patch.
          Hide
          Koji Sekiguchi added a comment -

          I added test cases for BoundaryScanner. Still need to modify FragmentsBuilderTests so that they can pass.

          Show
          Koji Sekiguchi added a comment - I added test cases for BoundaryScanner. Still need to modify FragmentsBuilderTests so that they can pass.
          Hide
          Koji Sekiguchi added a comment -

          Uh, I forgot to add testSentenceBoundary(), testLineBoundary() etc., rather than not only word boundary test. Will add in the next patch.

          Show
          Koji Sekiguchi added a comment - Uh, I forgot to add testSentenceBoundary(), testLineBoundary() etc., rather than not only word boundary test. Will add in the next patch.
          Hide
          Koji Sekiguchi added a comment -

          Patch. I added test cases for sentence boundary and line boundary. Still needed to fix existing test.

          Show
          Koji Sekiguchi added a comment - Patch. I added test cases for sentence boundary and line boundary. Still needed to fix existing test.
          Hide
          Robert Muir added a comment -

          Thanks for adding breakiterator implementations!

          the implementation seems independent of what type of breakiterator it uses, so maybe its simpler for
          it to just be BreakIteratorBoundaryScanner(BreakIterator bi), and then the user can create the
          breakiterator however they like (they could even pass in a custom subclass, for expert control) ?

          Show
          Robert Muir added a comment - Thanks for adding breakiterator implementations! the implementation seems independent of what type of breakiterator it uses, so maybe its simpler for it to just be BreakIteratorBoundaryScanner(BreakIterator bi), and then the user can create the breakiterator however they like (they could even pass in a custom subclass, for expert control) ?
          Hide
          Koji Sekiguchi added a comment -

          Good point. The simpler version is attached.
          Still working...

          Show
          Koji Sekiguchi added a comment - Good point. The simpler version is attached. Still working...
          Hide
          Koji Sekiguchi added a comment -

          New patch. Now all tests pass. Will commit soon.

          For Solr, I'll open a separate issue.

          Show
          Koji Sekiguchi added a comment - New patch. Now all tests pass. Will commit soon. For Solr, I'll open a separate issue.
          Hide
          Koji Sekiguchi added a comment -

          committed trunk and 3x.

          Show
          Koji Sekiguchi added a comment - committed trunk and 3x.
          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5

            People

            • Assignee:
              Koji Sekiguchi
              Reporter:
              Alex Vigdor
            • Votes:
              6 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development