Lucene - Core
  1. Lucene - Core
  2. LUCENE-4734

FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      If a proximity phrase query overlaps with any other query term it will not be highlighted.

      Example Text: A B C D E F G

      Example Queries:

      "B E"~10 D
      (D will be highlighted instead of "B C D E")

      "B E"~10 "C F"~10
      (nothing will be highlighted)

      This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be "B". The second TermInfo will be "D" which will not be found in the submap for "B E"~10 and will trigger a failed match.

      1. LUCENE-4734-2.patch
        6 kB
        Adrien Grand
      2. LUCENE-4734.patch
        22 kB
        Adrien Grand
      3. lucene-4734.patch
        9 kB
        Ryan Lauck

        Issue Links

          Activity

          Hide
          Ryan Lauck added a comment -

          Added a patch with two test cases that reproduce the issue

          Show
          Ryan Lauck added a comment - Added a patch with two test cases that reproduce the issue
          Hide
          Ryan Lauck added a comment -

          Tricky problem! I created a patch and modified my test cases (deleted the old test case patch).

          I'd appreciate any feedback, my solution seems durable and passes all highlighter test cases but I took a slightly different approach to finding the longest matching phrase.

          Also a bonus idea!
          The current addIfNoOverlap method assumes that we would never want overlapping highlights and throws them out. A better approach might be to allow the user to provide a delegate that can add/modify overlapping WeightedPhraseInfo, some possible implementations could be:

          • first only - current behavior
          • merge - creates a single WPI that covers all overlaps
          • split - gives priority to the first/existing WPI and add whatever is left overlaps
          • longest - gives priority to the longest WPI and drop any overlaps
          • include all - this use case solves my need to return only the offsets of all highlights in a document and perform the highlighting and overlap handling myself at a later stage.
          Show
          Ryan Lauck added a comment - Tricky problem! I created a patch and modified my test cases (deleted the old test case patch). I'd appreciate any feedback, my solution seems durable and passes all highlighter test cases but I took a slightly different approach to finding the longest matching phrase. Also a bonus idea! The current addIfNoOverlap method assumes that we would never want overlapping highlights and throws them out. A better approach might be to allow the user to provide a delegate that can add/modify overlapping WeightedPhraseInfo, some possible implementations could be: first only - current behavior merge - creates a single WPI that covers all overlaps split - gives priority to the first/existing WPI and add whatever is left overlaps longest - gives priority to the longest WPI and drop any overlaps include all - this use case solves my need to return only the offsets of all highlights in a document and perform the highlighting and overlap handling myself at a later stage.
          Hide
          Ryan Lauck added a comment -

          I hope I'm not stepping on any toes here, but I realized my patch is similar to some of the work done in LUCENE-4118. My patch also solves the bug where repeated terms in a proximity query cause highlight matching to fail.

          I also took a different approach to handling reverse order matching on slop queries so that this patch could be a complete alternative to LUCENE-4118. I modified QueryPhraseMap.add to detect PhraseQuerys with slop and create a second mapping for the phrase terms in reverse order - this way no other code needs to change to handle proximity phrase terms appearing in reverse order.

          I added two simple test cases for both reverse ordering and repeated terms.

          Show
          Ryan Lauck added a comment - I hope I'm not stepping on any toes here, but I realized my patch is similar to some of the work done in LUCENE-4118 . My patch also solves the bug where repeated terms in a proximity query cause highlight matching to fail. I also took a different approach to handling reverse order matching on slop queries so that this patch could be a complete alternative to LUCENE-4118 . I modified QueryPhraseMap.add to detect PhraseQuerys with slop and create a second mapping for the phrase terms in reverse order - this way no other code needs to change to handle proximity phrase terms appearing in reverse order. I added two simple test cases for both reverse ordering and repeated terms.
          Hide
          Ryan Lauck added a comment -

          Store the max possible slop on the QueryPhraseMap rather than the entire FieldQuery. This limits unnecessary matching when a PhraseQuery with a large slop is combined with other PhraseQuerys.

          Also, I added a fragment of slop recalculation code from WeightedSpanTermExtractor that handles PhraseQuerys with position gaps. The most common way this is encountered is by searching a phrase that contains stop words while using an analyzer that filters them.

          Also cleaned up the test cases a little, and added a few comments.

          Show
          Ryan Lauck added a comment - Store the max possible slop on the QueryPhraseMap rather than the entire FieldQuery. This limits unnecessary matching when a PhraseQuery with a large slop is combined with other PhraseQuerys. Also, I added a fragment of slop recalculation code from WeightedSpanTermExtractor that handles PhraseQuerys with position gaps. The most common way this is encountered is by searching a phrase that contains stop words while using an analyzer that filters them. Also cleaned up the test cases a little, and added a few comments.
          Hide
          Adrien Grand added a comment -

          Ryan, I iterated over your patch in order to be able to handle a few more queries, specifically phrase queries that contain gaps or have several terms at the same position.

          It is very hard to handle all possibilities without making the highlighting complexity explode. I'm looking forward to LUCENE-2878 so that highlighting can be more efficient and doesn't need to duplicate the query interpretation logic anymore.

          Show
          Adrien Grand added a comment - Ryan, I iterated over your patch in order to be able to handle a few more queries, specifically phrase queries that contain gaps or have several terms at the same position. It is very hard to handle all possibilities without making the highlighting complexity explode. I'm looking forward to LUCENE-2878 so that highlighting can be more efficient and doesn't need to duplicate the query interpretation logic anymore.
          Hide
          Ryan Lauck added a comment -

          Thanks Adrien!

          I agree about LUCENE-2878. I came to the same conclusion before finding that someone had already done most of the work that the ideal scenario is to (optionally) pull postings or term vectors in addition to payloads while scoring and expose them for highlighting. I'm looking forward to that patch too!

          An idea I began working on but haven't polished enough to submit a patch for:

          Users of the API could access raw highlight metadata (offsets and positions) and could additionally process to merge/filter/ignore overlapping highlights - one flaw I've had to work around in existing highlighters is that when highlights overlap they either merge them or toss all but the first encountered. We perform the highlighting manually in our system and hope to one day allow end users to toggle which terms are highlighted without having to make round-trips to the server to modify the search criteria and rerun the highlighter. With raw offset data this is trivial and merging/discarding overlaps can be handled in client-side code. There are additional advantages too such as being able to highlight find-in-page or search-within-search results and only having to transfer new offset metadata rather than the entire text over the wire (we have some very big 100MB+ documents).

          Show
          Ryan Lauck added a comment - Thanks Adrien! I agree about LUCENE-2878 . I came to the same conclusion before finding that someone had already done most of the work that the ideal scenario is to (optionally) pull postings or term vectors in addition to payloads while scoring and expose them for highlighting. I'm looking forward to that patch too! An idea I began working on but haven't polished enough to submit a patch for: Users of the API could access raw highlight metadata (offsets and positions) and could additionally process to merge/filter/ignore overlapping highlights - one flaw I've had to work around in existing highlighters is that when highlights overlap they either merge them or toss all but the first encountered. We perform the highlighting manually in our system and hope to one day allow end users to toggle which terms are highlighted without having to make round-trips to the server to modify the search criteria and rerun the highlighter. With raw offset data this is trivial and merging/discarding overlaps can be handled in client-side code. There are additional advantages too such as being able to highlight find-in-page or search-within-search results and only having to transfer new offset metadata rather than the entire text over the wire (we have some very big 100MB+ documents).
          Hide
          Adrien Grand added a comment -

          Hey Ryan, I think the use-case you are describing will be possible. However this will require some care because offsets computed by Lucene's analysis API are offsets for UTF16-encoded content (Java's internal encoding). So if your client code' programming language has a different internal encoding, you will need to perform conversions (this is not a fundamental problem, just something to be aware of in order not to get bad surprises).

          Show
          Adrien Grand added a comment - Hey Ryan, I think the use-case you are describing will be possible. However this will require some care because offsets computed by Lucene's analysis API are offsets for UTF16-encoded content (Java's internal encoding). So if your client code' programming language has a different internal encoding, you will need to perform conversions (this is not a fundamental problem, just something to be aware of in order not to get bad surprises).
          Hide
          ASF subversion and git services added a comment -

          Commit 1504862 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1504862 ]

          LUCENE-4734: Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.

          Show
          ASF subversion and git services added a comment - Commit 1504862 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1504862 ] LUCENE-4734 : Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.
          Hide
          ASF subversion and git services added a comment -

          Commit 1504863 from Adrien Grand in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1504863 ]

          LUCENE-4734: Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.

          Show
          ASF subversion and git services added a comment - Commit 1504863 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1504863 ] LUCENE-4734 : Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.
          Hide
          Adrien Grand added a comment -

          The approach I used can be memory-intensive when there are many positions that have several terms, here is a fix.

          Show
          Adrien Grand added a comment - The approach I used can be memory-intensive when there are many positions that have several terms, here is a fix.
          Hide
          ASF subversion and git services added a comment -

          Commit 1505731 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1505731 ]

          LUCENE-4734: Better memory efficiency.

          Show
          ASF subversion and git services added a comment - Commit 1505731 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1505731 ] LUCENE-4734 : Better memory efficiency.
          Hide
          ASF subversion and git services added a comment -

          Commit 1505732 from Adrien Grand in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1505732 ]

          LUCENE-4734: Better memory efficiency.

          Show
          ASF subversion and git services added a comment - Commit 1505732 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1505732 ] LUCENE-4734 : Better memory efficiency.
          Hide
          Ryan Lauck added a comment -

          I'm curious what you think about the comment I had in my original patch before calling addIfNoOverlap. It seems wasteful to traverse the phrase candidate list from the beginning every iteration to search for overlaps, and also prevents gathering the raw highlights as mentioned in my previous comment. What do you think about waiting until all phrase candidates are gathered to optionally filter overlaps?

          Show
          Ryan Lauck added a comment - I'm curious what you think about the comment I had in my original patch before calling addIfNoOverlap. It seems wasteful to traverse the phrase candidate list from the beginning every iteration to search for overlaps, and also prevents gathering the raw highlights as mentioned in my previous comment. What do you think about waiting until all phrase candidates are gathered to optionally filter overlaps?
          Hide
          Adrien Grand added a comment -

          I agree this seems wasteful. Maybe we could open an issue about it?

          Show
          Adrien Grand added a comment - I agree this seems wasteful. Maybe we could open an issue about it?
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Jack Krupansky added a comment -

          Looking at some highlighter code, I see this constructor in org.apache.lucene.search.vectorhighlight.FieldPhraseList.java of branch_4x:

          /**
           * a constructor.
           * 
           * @param fieldTermStack FieldTermStack object
           * @param fieldQuery FieldQuery object
           * @param phraseLimit maximum size of phraseList
           */
          public FieldPhraseList( FieldTermStack fieldTermStack, FieldQuery fieldQuery, int phraseLimit ){
            final String field = fieldTermStack.getFieldName();
          
            QueryPhraseMap qpm = fieldQuery.getRootMap(field);
            if (qpm != null) {
              LinkedList<TermInfo> phraseCandidate = new LinkedList<TermInfo>();
              extractPhrases(fieldTermStack.termList, qpm, phraseCandidate, 0);
              assert phraseCandidate.size() == 0;
            }
          }
          

          Clearly phraseLimit is no longer used. Is it being deprecated, or is this simply work in progress that will use it again eventually?

          This parameter is passed over several layers of code, ultimately it is set up in Solr using the hl.phraseLimit parameter.

          Seems like a "dead parameter" that should be cleaned up now or deprecated for future cleanup, but I can't say that I have been able to follow all of the work that has transpired in the highlighters.

          The change occurred in Revision 1505732 (related to this Jira.) Before then, this parameter was used.

          Comments? Or should this be a separate Jira issue?

          Show
          Jack Krupansky added a comment - Looking at some highlighter code, I see this constructor in org.apache.lucene.search.vectorhighlight.FieldPhraseList.java of branch_4x: /** * a constructor. * * @param fieldTermStack FieldTermStack object * @param fieldQuery FieldQuery object * @param phraseLimit maximum size of phraseList */ public FieldPhraseList( FieldTermStack fieldTermStack, FieldQuery fieldQuery, int phraseLimit ){ final String field = fieldTermStack.getFieldName(); QueryPhraseMap qpm = fieldQuery.getRootMap(field); if (qpm != null ) { LinkedList<TermInfo> phraseCandidate = new LinkedList<TermInfo>(); extractPhrases(fieldTermStack.termList, qpm, phraseCandidate, 0); assert phraseCandidate.size() == 0; } } Clearly phraseLimit is no longer used. Is it being deprecated, or is this simply work in progress that will use it again eventually? This parameter is passed over several layers of code, ultimately it is set up in Solr using the hl.phraseLimit parameter. Seems like a "dead parameter" that should be cleaned up now or deprecated for future cleanup, but I can't say that I have been able to follow all of the work that has transpired in the highlighters. The change occurred in Revision 1505732 (related to this Jira.) Before then, this parameter was used. Comments? Or should this be a separate Jira issue?
          Hide
          Adrien Grand added a comment -

          I've encountered several issues recently because of this patch I committed and I am considering reverting it. The issue is that it makes the runtime of FastVectorHighlighter very bad and this just doesn't work when the text to highlight has lots of occurrences of the query terms.

          I spent time trying to improve it but I couldn't find a way to make it work nicely. I wouldn't feel good to have a release of Lucene out with this patch in it so I will revert the patch tomorrow if there is no objection.

          Show
          Adrien Grand added a comment - I've encountered several issues recently because of this patch I committed and I am considering reverting it. The issue is that it makes the runtime of FastVectorHighlighter very bad and this just doesn't work when the text to highlight has lots of occurrences of the query terms. I spent time trying to improve it but I couldn't find a way to make it work nicely. I wouldn't feel good to have a release of Lucene out with this patch in it so I will revert the patch tomorrow if there is no objection.
          Hide
          Ryan Lauck added a comment -

          I did some performance testing of my original patch so I'd be happy to take a look at yours as well and see if I can come up with anything to help. Another idea I explored but never submitted was to build a finite state automata out of the query terms (each term is a state) and use that to highlight. Benchmarks were unstable but it seemed like a workable solution for complex queries.

          The real question is: does it make more sense to invest time in LUCENE-2878 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases...

          Show
          Ryan Lauck added a comment - I did some performance testing of my original patch so I'd be happy to take a look at yours as well and see if I can come up with anything to help. Another idea I explored but never submitted was to build a finite state automata out of the query terms (each term is a state) and use that to highlight. Benchmarks were unstable but it seemed like a workable solution for complex queries. The real question is: does it make more sense to invest time in LUCENE-2878 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases...
          Hide
          Adrien Grand added a comment -

          The real question is: does it make more sense to invest time in LUCENE-2878 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases...

          This is my feeling too.

          Show
          Adrien Grand added a comment - The real question is: does it make more sense to invest time in LUCENE-2878 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases... This is my feeling too.
          Hide
          Simon Willnauer added a comment -

          The real question is: does it make more sense to invest time in LUCENE-2878 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases..

          +1 lets do it

          +1 to revert the change!

          Show
          Simon Willnauer added a comment - The real question is: does it make more sense to invest time in LUCENE-2878 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases.. +1 lets do it +1 to revert the change!
          Hide
          ASF subversion and git services added a comment -

          Commit 1520536 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1520536 ]

          Revert LUCENE-4734.

          Show
          ASF subversion and git services added a comment - Commit 1520536 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1520536 ] Revert LUCENE-4734 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1520544 from Adrien Grand in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1520544 ]

          Revert LUCENE-4734.

          Show
          ASF subversion and git services added a comment - Commit 1520544 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1520544 ] Revert LUCENE-4734 .
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.

            People

            • Assignee:
              Unassigned
              Reporter:
              Ryan Lauck
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development