Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7620

UnifiedHighlighter: add target character width BreakIterator wrapper

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The original Highlighter includes a SimpleFragmenter that delineates fragments (aka Passages) by a character width. The default is 100 characters.

      It would be great to support something similar for the UnifiedHighlighter. It's useful in its own right and of course it helps users transition to the UH. I'd like to do it as a wrapper to another BreakIterator – perhaps a sentence one. In this way you get back Passages that are a number of sentences so they will look nice instead of breaking mid-way through a sentence. And you get some control by specifying a target number of characters. This BreakIterator wouldn't be a general purpose java.text.BreakIterator since it would assume it's called in a manner exactly as the UnifiedHighlighter uses it. It would probably be compatible with the PostingsHighlighter too.

      I don't propose doing this by default; besides, it's easy enough to pick your BreakIterator config.

        Issue Links

          Activity

          Hide
          Timothy055 Timothy M. Rodriguez added a comment -

          Me too!

          Show
          Timothy055 Timothy M. Rodriguez added a comment - Me too!
          Hide
          dsmiley David Smiley added a comment -

          Thanks for the review feedback Jim & Tim! 6.4 is going to be a great release for the UnifiedHighlighter. I hope features like this and other improvements this release get more folks using the UH.

          Show
          dsmiley David Smiley added a comment - Thanks for the review feedback Jim & Tim! 6.4 is going to be a great release for the UnifiedHighlighter. I hope features like this and other improvements this release get more folks using the UH.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ff5fdcde422033d9cbe3dbe11f2abda9ee3a408b in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff5fdcd ]

          LUCENE-7620: UnifiedHighlighter: new LengthGoalBreakIterator wrapper

          (cherry picked from commit ea49989)

          Show
          jira-bot ASF subversion and git services added a comment - Commit ff5fdcde422033d9cbe3dbe11f2abda9ee3a408b in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff5fdcd ] LUCENE-7620 : UnifiedHighlighter: new LengthGoalBreakIterator wrapper (cherry picked from commit ea49989)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ea49989524e96563f2b9bdd4256012239907882f in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea49989 ]

          LUCENE-7620: UnifiedHighlighter: new LengthGoalBreakIterator wrapper

          Show
          jira-bot ASF subversion and git services added a comment - Commit ea49989524e96563f2b9bdd4256012239907882f in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea49989 ] LUCENE-7620 : UnifiedHighlighter: new LengthGoalBreakIterator wrapper
          Hide
          jim.ferenczi Jim Ferenczi added a comment -

          That makes sense. The tests looks good, . I think it's ready, thanks for doing this David !

          Show
          jim.ferenczi Jim Ferenczi added a comment - That makes sense. The tests looks good, . I think it's ready, thanks for doing this David !
          Hide
          dsmiley David Smiley added a comment -

          Here's an update to the patch mostly related to testing to clarify what's being tested. And I did the createClosestToLength rename.

          Show
          dsmiley David Smiley added a comment - Here's an update to the patch mostly related to testing to clarify what's being tested. And I did the createClosestToLength rename.
          Hide
          dsmiley David Smiley added a comment - - edited

          (Tim) For the following method, does it make sense to return the baseIter if the followingIdx < startIndex? Maybe throw an exception instead or just have an assert that it's less?

          It's already asserting (line 124). Or I'm not understanding you.

          (Tim) This is subjective, but I find it's more useful to break out the different tests with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom, etc. Similar for the defaultSummary tests. This helps when coming back to the test and helps tease apart if one piece of functionality is broken vs another.

          Fair point. A better compromise in my mind that is not as verbose as your suggestion is to use the "message" parameter of the assert methods. I will do this and upload a new patch tonight.

          (Jim) ... but I wonder if the logic to get the boundary could not be simplified. Isn't it possible to always invoke baseIter.preceding(targetIdx) and based on isMinimumSize return current() or baseIter.next() ?

          No; I don't think so. If one looks at preceding(target), you don't know if it's result is closer to the target than the following break or not. The "target" mode of this BI gets the closest break. Come to think of it, maybe I should rename createTargetLength to be createClosestToLength. At least it's javadocs are already clear I think?

          Show
          dsmiley David Smiley added a comment - - edited (Tim) For the following method, does it make sense to return the baseIter if the followingIdx < startIndex? Maybe throw an exception instead or just have an assert that it's less? It's already asserting (line 124). Or I'm not understanding you. (Tim) This is subjective, but I find it's more useful to break out the different tests with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom, etc. Similar for the defaultSummary tests. This helps when coming back to the test and helps tease apart if one piece of functionality is broken vs another. Fair point. A better compromise in my mind that is not as verbose as your suggestion is to use the "message" parameter of the assert methods. I will do this and upload a new patch tonight. (Jim) ... but I wonder if the logic to get the boundary could not be simplified. Isn't it possible to always invoke baseIter.preceding(targetIdx) and based on isMinimumSize return current() or baseIter.next() ? No; I don't think so. If one looks at preceding(target) , you don't know if it's result is closer to the target than the following break or not. The "target" mode of this BI gets the closest break. Come to think of it, maybe I should rename createTargetLength to be createClosestToLength . At least it's javadocs are already clear I think?
          Hide
          jim.ferenczi Jim Ferenczi added a comment -

          I think the two methods to create the break iterator are useful but I wonder if the logic to get the boundary could not be simplified.
          Isn't it possible to always invoke baseIter.preceding(targetIdx) and based on isMinimumSize return current() or baseIter.next() ?

          Show
          jim.ferenczi Jim Ferenczi added a comment - I think the two methods to create the break iterator are useful but I wonder if the logic to get the boundary could not be simplified. Isn't it possible to always invoke baseIter.preceding(targetIdx) and based on isMinimumSize return current() or baseIter.next() ?
          Hide
          Timothy055 Timothy M. Rodriguez added a comment - - edited

          Very useful! I like that it decorates an underlying BreakIterator. For the following method, does it make sense to return the baseIter if the followingIdx < startIndex? Maybe throw an exception instead or just have an assert that it's less?

          This is subjective, but I find it's more useful to break out the different tests with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom, etc. Similar for the defaultSummary tests. This helps when coming back to the test and helps tease apart if one piece of functionality is broken vs another.

          Show
          Timothy055 Timothy M. Rodriguez added a comment - - edited Very useful! I like that it decorates an underlying BreakIterator. For the following method, does it make sense to return the baseIter if the followingIdx < startIndex? Maybe throw an exception instead or just have an assert that it's less? This is subjective, but I find it's more useful to break out the different tests with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom, etc. Similar for the defaultSummary tests. This helps when coming back to the test and helps tease apart if one piece of functionality is broken vs another.
          Hide
          dsmiley David Smiley added a comment -

          Here's an updated patch. I added assertions not exceptions because if per chance this circumstance happens in production, it's really okay to return possibly the wrong break and have a passage that isn't quite the ideal size rather than throw some exception.

          It now has 2 modes of operation, with 2 corresponding factory methods to clarify which: createMinLength(...) and createTargetLength(...). The minLength mode might be useful because it's faster (than target). I think it's more useful than a MaxLength (which still could be added in the future) because a too-long passage can possibly be trimmed by the client, but the reverse is not true – you can't lengthen a passage that is too short (if it reaches the client talking to a search server).

          I did some benchmarking too; which in addition to observing the overhead also served to help ensure it didn't throw exceptions (at least for the test queries & test data). That never happened though; I squashed bugs in the test and chose sizes to tease out the edge conditions. In so doing I found a minor bug with CustomSeparatorBreakIterator but I'll leave that for another time. Benchmarking showed the minLength is noticeably faster than targetLength, maybe 10% overall. Also, (something I already knew) I observed a "cheap" underlying BreakIterator like CustomSeparatorBreakIterator is ~20% faster than a JDK Sentence one.

          I'll commit it this weekend or possibly tonight if you review it in-time positively.

          Show
          dsmiley David Smiley added a comment - Here's an updated patch. I added assertions not exceptions because if per chance this circumstance happens in production, it's really okay to return possibly the wrong break and have a passage that isn't quite the ideal size rather than throw some exception. It now has 2 modes of operation, with 2 corresponding factory methods to clarify which: createMinLength(...) and createTargetLength(...) . The minLength mode might be useful because it's faster (than target). I think it's more useful than a MaxLength (which still could be added in the future) because a too-long passage can possibly be trimmed by the client, but the reverse is not true – you can't lengthen a passage that is too short (if it reaches the client talking to a search server). I did some benchmarking too; which in addition to observing the overhead also served to help ensure it didn't throw exceptions (at least for the test queries & test data). That never happened though; I squashed bugs in the test and chose sizes to tease out the edge conditions. In so doing I found a minor bug with CustomSeparatorBreakIterator but I'll leave that for another time. Benchmarking showed the minLength is noticeably faster than targetLength, maybe 10% overall. Also, (something I already knew) I observed a "cheap" underlying BreakIterator like CustomSeparatorBreakIterator is ~20% faster than a JDK Sentence one. I'll commit it this weekend or possibly tonight if you review it in-time positively.
          Hide
          jim.ferenczi Jim Ferenczi added a comment -

          By choosing a lengthGoal on the low side; maybe "too long" will tend not to be a problem? Or see my TODO at the top of the file – essentially choose the break that is closest to the goal instead of always the first following it.

          Yeah depends how the lengthGoal is perceived. I was looking at it as a boundary mainly to solve "too long" fragment. And this issue is more about "too short" fragments. Maybe a different issue then but I am just afraid that we'll end up with multiple public break iterator impls that must follow a specific pattern to be used.
          Anyway this patch is a start to get better highlighting through custom break iterator and it solves a real issue. Please push to 6.4 if you think it's ready, we can always discuss the next steps in a follow up.
          Regarding the assertion I prefer an IllegalStateException with a clear message but I am maybe too paranoid.

          Show
          jim.ferenczi Jim Ferenczi added a comment - By choosing a lengthGoal on the low side; maybe "too long" will tend not to be a problem? Or see my TODO at the top of the file – essentially choose the break that is closest to the goal instead of always the first following it. Yeah depends how the lengthGoal is perceived. I was looking at it as a boundary mainly to solve "too long" fragment. And this issue is more about "too short" fragments. Maybe a different issue then but I am just afraid that we'll end up with multiple public break iterator impls that must follow a specific pattern to be used. Anyway this patch is a start to get better highlighting through custom break iterator and it solves a real issue. Please push to 6.4 if you think it's ready, we can always discuss the next steps in a follow up. Regarding the assertion I prefer an IllegalStateException with a clear message but I am maybe too paranoid.
          Hide
          dsmiley David Smiley added a comment -

          Though I wonder if we should also break the sentence if it's too long ? Maybe the wrapped breakiterator could always be a sentence one and we could use a WordBreakIterator to cut sentences that are too long ? This way it would produce snippets that are similar to the SimpleFragmenter.

          It could also be done in another breakiterator on top of this one but this would make things over complicated, I guess.

          By choosing a lengthGoal on the low side; maybe "too long" will tend not to be a problem? Or see my TODO at the top of the file – essentially choose the break that is closest to the goal instead of always the first following it. Maybe I'll add that in my next patch.

          I don't think we should try to emulate SimpleFragmenter exactly. We can do a much better job I like this implementation as a wrapper BreakIterator.... perhaps we'll add a Regex BI one day and then it would simply fit right in.

          For the implementation can you throw an exception on the method that should not be called ? For instance ...(etc)

          Yeah I could go either way on that... how about assert false : "not supported/expected";?

          Additionally I think that we should have a way to change the start and end of a passage when we know all the match that it contains. This is what the FVH is doing and it should be doable in the UH because the passage are created on the fly in forward manner. This is of course not the purpose of this issue and it should be treated as a new feature but I think it would be great to have the same output than the FVH when the max length of the passage is set.

          Definitely a separate issue. It wouldn't fit into the BreakIterator abstraction either. Maybe some Passage post-processor like thing. Or maybe simply expose sufficient hooks to allow subclassers to do this. That keeps the UH simpler.

          Show
          dsmiley David Smiley added a comment - Though I wonder if we should also break the sentence if it's too long ? Maybe the wrapped breakiterator could always be a sentence one and we could use a WordBreakIterator to cut sentences that are too long ? This way it would produce snippets that are similar to the SimpleFragmenter. It could also be done in another breakiterator on top of this one but this would make things over complicated, I guess. By choosing a lengthGoal on the low side; maybe "too long" will tend not to be a problem? Or see my TODO at the top of the file – essentially choose the break that is closest to the goal instead of always the first following it. Maybe I'll add that in my next patch. I don't think we should try to emulate SimpleFragmenter exactly. We can do a much better job I like this implementation as a wrapper BreakIterator.... perhaps we'll add a Regex BI one day and then it would simply fit right in. For the implementation can you throw an exception on the method that should not be called ? For instance ...(etc) Yeah I could go either way on that... how about assert false : "not supported/expected"; ? Additionally I think that we should have a way to change the start and end of a passage when we know all the match that it contains. This is what the FVH is doing and it should be doable in the UH because the passage are created on the fly in forward manner. This is of course not the purpose of this issue and it should be treated as a new feature but I think it would be great to have the same output than the FVH when the max length of the passage is set. Definitely a separate issue. It wouldn't fit into the BreakIterator abstraction either. Maybe some Passage post-processor like thing. Or maybe simply expose sufficient hooks to allow subclassers to do this. That keeps the UH simpler.
          Hide
          jim.ferenczi Jim Ferenczi added a comment -

          It looks good David Smiley ! I've started to work on something similar but got caught into something else
          Though I wonder if we should also break the sentence if it's too long ? Maybe the wrapped breakiterator could always be a sentence one and we could use a WordBreakIterator to cut sentences that are too long ? This way it would produce snippets that are similar to the SimpleFragmenter.
          It could also be done in another breakiterator on top of this one but this would make things over complicated, I guess.
          For the implementation can you throw an exception on the method that should not be called ? For instance

          next(n)

          cannot be implemented efficiently (you need to start from 0 if you want to know the Nth boundary) but currently it returns the Nth boundary of the wrapped break iterator. I think it's better to throw an exception, this way it is obvious that some methods should not be called.

          Additionally I think that we should have a way to change the start and end of a passage when we know all the match that it contains. This is what the FVH is doing and it should be doable in the UH because the passage are created on the fly in forward manner. This is of course not the purpose of this issue and it should be treated as a new feature but I think it would be great to have the same output than the FVH when the max length of the passage is set.

          Show
          jim.ferenczi Jim Ferenczi added a comment - It looks good David Smiley ! I've started to work on something similar but got caught into something else Though I wonder if we should also break the sentence if it's too long ? Maybe the wrapped breakiterator could always be a sentence one and we could use a WordBreakIterator to cut sentences that are too long ? This way it would produce snippets that are similar to the SimpleFragmenter. It could also be done in another breakiterator on top of this one but this would make things over complicated, I guess. For the implementation can you throw an exception on the method that should not be called ? For instance next(n) cannot be implemented efficiently (you need to start from 0 if you want to know the Nth boundary) but currently it returns the Nth boundary of the wrapped break iterator. I think it's better to throw an exception, this way it is obvious that some methods should not be called. Additionally I think that we should have a way to change the start and end of a passage when we know all the match that it contains. This is what the FVH is doing and it should be doable in the UH because the passage are created on the fly in forward manner. This is of course not the purpose of this issue and it should be treated as a new feature but I think it would be great to have the same output than the FVH when the max length of the passage is set.
          Hide
          dsmiley David Smiley added a comment -

          Here's a patch. I'm calling it LengthGoalBreakIterator. In time, perhaps we might add some tweaks like a "slop" akin to the LuceneRegexFragmenter (in Solr).

          Jim Ferenczi I thought you might want to take a peek. I figure this can get into 6.4; I'll commit it this weekend.

          Show
          dsmiley David Smiley added a comment - Here's a patch. I'm calling it LengthGoalBreakIterator . In time, perhaps we might add some tweaks like a "slop" akin to the LuceneRegexFragmenter (in Solr). Jim Ferenczi I thought you might want to take a peek. I figure this can get into 6.4; I'll commit it this weekend.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dsmiley David Smiley
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development