Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8477

Improve handling of inner disjunctions in intervals

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 8.1
    • None
    • None
    • New

    Description

      The current implementation of the disjunction interval produced by Intervals.or is a direct implementation of the OR operator from the Vigna paper.  This produces minimal intervals, meaning that (a) is preferred over (a b), and (b) also over (a b).  This has advantages when it comes to counting intervals for scoring, but also has drawbacks when it comes to matching.  For example, a phrase query for ((a OR (a b)) BLOCK (c)) will not match the document (a b c), because (a) will be preferred over (a b), and (a c) does not match.

      This ticket is to discuss the best way of dealing with disjunctions.

      Attachments

        1. LUCENE-8477.patch
          45 kB
          Alan Woodward
        2. LUCENE-8477.patch
          40 kB
          Alan Woodward
        3. LUCENE-8477.patch
          16 kB
          Alan Woodward
        4. LUCENE-8477.patch
          19 kB
          Alan Woodward

        Issue Links

          Activity

            romseygeek Alan Woodward added a comment -

            I can see a couple of options here:

            1) Add a new operator, OR_MAX, which doesn't try to minimize its internals, and sorts prefixes last.  This deals with ((a OR (a b)) BLOCK c) mentioned in the description, but it still fails to match in other situations, such as (b OR (b c)) BLOCK c - in this case because (b c) will sort before (b), so the interval will try to match (b c c).  It also makes it less easy to use, as consumers now need to understand the semantics of two separate OR operators

            2) Allow IntervalsSource to rewrite itself, so that ((a OR (a b)) BLOCK c) becomes (a BLOCK c) OR ((a b) BLOCK c).  This would be a lot easier on the user, but I'm not sure how easy it would be from an implementation point of view - it may end up adding lots of extra methods to IntervalsSource.

            romseygeek Alan Woodward added a comment - I can see a couple of options here: 1) Add a new operator, OR_MAX, which doesn't try to minimize its internals, and sorts prefixes last.  This deals with ((a OR (a b)) BLOCK c) mentioned in the description, but it still fails to match in other situations, such as (b OR (b c)) BLOCK c - in this case because (b c) will sort before (b), so the interval will try to match (b c c).  It also makes it less easy to use, as consumers now need to understand the semantics of two separate OR operators 2) Allow IntervalsSource to rewrite itself, so that ((a OR (a b)) BLOCK c) becomes (a BLOCK c) OR ((a b) BLOCK c).  This would be a lot easier on the user, but I'm not sure how easy it would be from an implementation point of view - it may end up adding lots of extra methods to IntervalsSource.
            jpountz Adrien Grand added a comment -

            I'd be tempted to just document this behavior for now. I'm afraid that introducing non-minimized intervals will introduce similar corner-cases to what we have with spans and sloppy phrase queries?

            Rewriting automatically feels a bit wrong given that we would be replacing an IntervalsSource with another IntervalsSource that has different matches. However this is something that could be implemented on top of intervals in query parsers by having an intermediate representation of IntervalsSources and push disjunctions to the top?

            jpountz Adrien Grand added a comment - I'd be tempted to just document this behavior for now. I'm afraid that introducing non-minimized intervals will introduce similar corner-cases to what we have with spans and sloppy phrase queries? Rewriting automatically feels a bit wrong given that we would be replacing an IntervalsSource with another IntervalsSource that has different matches. However this is something that could be implemented on top of intervals in query parsers by having an intermediate representation of IntervalsSources and push disjunctions to the top?
            romseygeek Alan Woodward added a comment -

            I like jpountz's suggestion just to document this.  Parser implementers can check for disjunctions with variable lengths and push those up the tree.

            Martin Hermann or goller@detego-software.de would this work for you?

            romseygeek Alan Woodward added a comment - I like jpountz 's suggestion just to document this.  Parser implementers can check for disjunctions with variable lengths and push those up the tree. Martin Hermann or goller@detego-software.de would this work for you?
            romseygeek Alan Woodward added a comment -

            Here is a proposal to fix this, using the new QueryVisitor API to work out if disjunctions have any sub-clauses with common first terms.  Given an interval BLOCK(a,or(BLOCK(b,c),b),d) we can ensure that all matches are collected by rewriting things so that the final clause d is moved inside the disjunction, yielding BLOCK(a,or(BLOCK(b,c,d),BLOCK(b,d))).  Checking for common prefixes means that intervals of the form BLOCK(a,or(BLOCK(b,c),d),e) don't need to be rewritten, which will be more efficient when the query is run as we only need to iterate positions for the final term once.

            romseygeek Alan Woodward added a comment - Here is a proposal to fix this, using the new QueryVisitor API to work out if disjunctions have any sub-clauses with common first terms.  Given an interval BLOCK(a,or(BLOCK(b,c),b),d) we can ensure that all matches are collected by rewriting things so that the final clause  d is moved inside the disjunction, yielding BLOCK(a,or(BLOCK(b,c,d),BLOCK(b,d))) .  Checking for common prefixes means that intervals of the form BLOCK(a,or(BLOCK(b,c),d),e) don't need to be rewritten, which will be more efficient when the query is run as we only need to iterate positions for the final term once.
            romseygeek Alan Woodward added a comment -

            Here's a better patch, using term counting rather than prefix matching - the latter won't work if we have stacked tokens, for example, and this makes things much simpler.

            romseygeek Alan Woodward added a comment - Here's a better patch, using term counting rather than prefix matching - the latter won't work if we have stacked tokens, for example, and this makes things much simpler.
            romseygeek Alan Woodward added a comment -

            Another iteration. Instead of using term counting, we now compare minExtent(); anything greater than 1 is a candidate for rewriting, because it can lead to different-length overlaps in the disjunction due to term stacking, and minExtent() also takes into account things like Intervals.extend() which may only have a single term but can have a length of 2 or more. It also adds a new method to IntervalsSource, getDisjunctions(), which allows this rewriting to work even for filtered intervals like WITHIN or CONTAINING.

            romseygeek Alan Woodward added a comment - Another iteration. Instead of using term counting, we now compare minExtent(); anything greater than 1 is a candidate for rewriting, because it can lead to different-length overlaps in the disjunction due to term stacking, and minExtent() also takes into account things like Intervals.extend() which may only have a single term but can have a length of 2 or more. It also adds a new method to IntervalsSource, getDisjunctions(), which allows this rewriting to work even for filtered intervals like WITHIN or CONTAINING.
            romseygeek Alan Woodward added a comment -

            New patch, fixes a bug with how MAXGAPS and MAXWIDTH filters were dealing with inner disjunctions.

            romseygeek Alan Woodward added a comment - New patch, fixes a bug with how MAXGAPS and MAXWIDTH filters were dealing with inner disjunctions.
            jim.ferenczi Jim Ferenczi added a comment -

            The patch fixes disjunctions that share a common prefix but the same problem can arise for disjunctions that share suffixes. For instance the query or(york, BLOCK(new, york)) has the same minimum interval semantic than "york". So a query like BLOCK(in, or(york, BLOCK(new, york))) will not match "in new york" because "new york" is discarded by the minimum interval "york". We could apply the same logic and rewrite the query automatically but I am sure we can find other pathological cases due to minimum interval semantics. IMO we should document this unintuitive behavior rather than rewriting all queries in a non-optimal form.

            jim.ferenczi Jim Ferenczi added a comment - The patch fixes disjunctions that share a common prefix but the same problem can arise for disjunctions that share suffixes. For instance the query or(york, BLOCK(new, york)) has the same minimum interval semantic than "york". So a query like BLOCK(in, or(york, BLOCK(new, york))) will not match "in new york" because "new york" is discarded by the minimum interval "york". We could apply the same logic and rewrite the query automatically but I am sure we can find other pathological cases due to minimum interval semantics. IMO we should document this unintuitive behavior rather than rewriting all queries in a non-optimal form.
            romseygeek Alan Woodward added a comment -

            I've opened a PR to make discussing this easier, as it's grown to a fairly big change (although the public API is pretty much the same): https://github.com/apache/lucene-solr/pull/620

            I agree that rewriting queries can be sub-optimal, but I think we still need to make it possible to get accurate hits, which
            is currently difficult to do at construction time because disjunctions can end up being wrapped multiple times, and the implementing classes are all package-private so you can't just use instanceof checks.

            My suggestion is that we automatically rewrite things to match accurately, but add a flag to Intervals.or() that allows you to opt out of the rewriting if you want speed above accuracy, or if you know that the members of a disjunction won't overlap (for example if you have no synonyms and so know that there are no stacked tokens).

            romseygeek Alan Woodward added a comment - I've opened a PR to make discussing this easier, as it's grown to a fairly big change (although the public API is pretty much the same): https://github.com/apache/lucene-solr/pull/620 I agree that rewriting queries can be sub-optimal, but I think we still need to make it possible to get accurate hits, which is currently difficult to do at construction time because disjunctions can end up being wrapped multiple times, and the implementing classes are all package-private so you can't just use instanceof checks. My suggestion is that we automatically rewrite things to match accurately, but add a flag to Intervals.or() that allows you to opt out of the rewriting if you want speed above accuracy, or if you know that the members of a disjunction won't overlap (for example if you have no synonyms and so know that there are no stacked tokens).

            Commit f1782d0dd1195c823c79aab87529ebd7e8217b95 in lucene-solr's branch refs/heads/master from Alan Woodward
            [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f1782d0 ]

            LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter (#620)

            We have a number of IntervalsSource implementations where automatic minimization of
            disjunctions can lead to surprising results:

            • PHRASE queries can miss matches because a longer matching sub-source is minimized
              away, leaving a gap
            • MAXGAPS queries can miss matches for the same reason
            • CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
              can miss matches if the 'big' interval gets minimized

            The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
            of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
            OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
            add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
            for each source that it would apply to.

            Because these rewritten queries will in general be less efficient due to the duplication of
            effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
            original), we also add an option to Intervals.or() that will prevent this happening, so that
            consumers can choose speed over accuracy if it suits their usecase.

            jira-bot ASF subversion and git services added a comment - Commit f1782d0dd1195c823c79aab87529ebd7e8217b95 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f1782d0 ] LUCENE-8477 : Automatically rewrite disjunctions when internal gaps matter (#620) We have a number of IntervalsSource implementations where automatic minimization of disjunctions can lead to surprising results: PHRASE queries can miss matches because a longer matching sub-source is minimized away, leaving a gap MAXGAPS queries can miss matches for the same reason CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries can miss matches if the 'big' interval gets minimized The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting for each source that it would apply to. Because these rewritten queries will in general be less efficient due to the duplication of effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the original), we also add an option to Intervals.or() that will prevent this happening, so that consumers can choose speed over accuracy if it suits their usecase.

            Commit 2571bf355f82e193d50ec2509f83ba698b262562 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward
            [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2571bf3 ]

            LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter (#620)

            We have a number of IntervalsSource implementations where automatic minimization of
            disjunctions can lead to surprising results:

            • PHRASE queries can miss matches because a longer matching sub-source is minimized
              away, leaving a gap
            • MAXGAPS queries can miss matches for the same reason
            • CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
              can miss matches if the 'big' interval gets minimized

            The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
            of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
            OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
            add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
            for each source that it would apply to.

            Because these rewritten queries will in general be less efficient due to the duplication of
            effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
            original), we also add an option to Intervals.or() that will prevent this happening, so that
            consumers can choose speed over accuracy if it suits their usecase.

            jira-bot ASF subversion and git services added a comment - Commit 2571bf355f82e193d50ec2509f83ba698b262562 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2571bf3 ] LUCENE-8477 : Automatically rewrite disjunctions when internal gaps matter (#620) We have a number of IntervalsSource implementations where automatic minimization of disjunctions can lead to surprising results: PHRASE queries can miss matches because a longer matching sub-source is minimized away, leaving a gap MAXGAPS queries can miss matches for the same reason CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries can miss matches if the 'big' interval gets minimized The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting for each source that it would apply to. Because these rewritten queries will in general be less efficient due to the duplication of effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the original), we also add an option to Intervals.or() that will prevent this happening, so that consumers can choose speed over accuracy if it suits their usecase.
            romseygeek Alan Woodward added a comment -

            Thanks for the reviews jim.ferenczi!

            romseygeek Alan Woodward added a comment - Thanks for the reviews jim.ferenczi !

            Commit 0bb9b95ac7397e3c2d78b51cb49f7075ea1574e5 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward
            [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0bb9b95 ]

            LUCENE-8477: Add CHANGES entry

            jira-bot ASF subversion and git services added a comment - Commit 0bb9b95ac7397e3c2d78b51cb49f7075ea1574e5 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0bb9b95 ] LUCENE-8477 : Add CHANGES entry

            Commit 3a63c58db3074d4a0a2dbf4cf3147f6d6cdf73ca in lucene-solr's branch refs/heads/master from Alan Woodward
            [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3a63c58 ]

            LUCENE-8477: Add CHANGES entry

            jira-bot ASF subversion and git services added a comment - Commit 3a63c58db3074d4a0a2dbf4cf3147f6d6cdf73ca in lucene-solr's branch refs/heads/master from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3a63c58 ] LUCENE-8477 : Add CHANGES entry

            Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/master from Jim Ferenczi
            [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ]

            LUCENE-8477: Restore public ctr for FilteredIntervalsSource

            jira-bot ASF subversion and git services added a comment - Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/master from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ] LUCENE-8477 : Restore public ctr for FilteredIntervalsSource

            Commit e460356abeb1bd075a885d905a1d0873469bbd43 in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi
            [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e460356 ]

            LUCENE-8477: Restore public ctr for FilteredIntervalsSource

            jira-bot ASF subversion and git services added a comment - Commit e460356abeb1bd075a885d905a1d0873469bbd43 in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e460356 ] LUCENE-8477 : Restore public ctr for FilteredIntervalsSource

            Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/jira/LUCENE-8738 from Jim Ferenczi
            [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ]

            LUCENE-8477: Restore public ctr for FilteredIntervalsSource

            jira-bot ASF subversion and git services added a comment - Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/jira/ LUCENE-8738 from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ] LUCENE-8477 : Restore public ctr for FilteredIntervalsSource
            tomoko Tomoko Uchida added a comment -

            This issue was moved to GitHub issue: #9523.

            tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #9523 .

            People

              romseygeek Alan Woodward
              romseygeek Alan Woodward
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m