[LUCENE-8477] Improve handling of inner disjunctions in intervals - ASF JIRA

Alan Woodward added a comment - 03/Sep/18 12:53

I can see a couple of options here:

1) Add a new operator, OR_MAX, which doesn't try to minimize its internals, and sorts prefixes last. This deals with ((a OR (a b)) BLOCK c) mentioned in the description, but it still fails to match in other situations, such as (b OR (b c)) BLOCK c - in this case because (b c) will sort before (b), so the interval will try to match (b c c). It also makes it less easy to use, as consumers now need to understand the semantics of two separate OR operators

2) Allow IntervalsSource to rewrite itself, so that ((a OR (a b)) BLOCK c) becomes (a BLOCK c) OR ((a b) BLOCK c). This would be a lot easier on the user, but I'm not sure how easy it would be from an implementation point of view - it may end up adding lots of extra methods to IntervalsSource.

Alan Woodward added a comment - 03/Sep/18 12:53 I can see a couple of options here: 1) Add a new operator, OR_MAX, which doesn't try to minimize its internals, and sorts prefixes last. This deals with ((a OR (a b)) BLOCK c) mentioned in the description, but it still fails to match in other situations, such as (b OR (b c)) BLOCK c - in this case because (b c) will sort before (b), so the interval will try to match (b c c). It also makes it less easy to use, as consumers now need to understand the semantics of two separate OR operators 2) Allow IntervalsSource to rewrite itself, so that ((a OR (a b)) BLOCK c) becomes (a BLOCK c) OR ((a b) BLOCK c). This would be a lot easier on the user, but I'm not sure how easy it would be from an implementation point of view - it may end up adding lots of extra methods to IntervalsSource.

Adrien Grand added a comment - 03/Sep/18 13:53

I'd be tempted to just document this behavior for now. I'm afraid that introducing non-minimized intervals will introduce similar corner-cases to what we have with spans and sloppy phrase queries?

Rewriting automatically feels a bit wrong given that we would be replacing an IntervalsSource with another IntervalsSource that has different matches. However this is something that could be implemented on top of intervals in query parsers by having an intermediate representation of IntervalsSources and push disjunctions to the top?

Adrien Grand added a comment - 03/Sep/18 13:53 I'd be tempted to just document this behavior for now. I'm afraid that introducing non-minimized intervals will introduce similar corner-cases to what we have with spans and sloppy phrase queries? Rewriting automatically feels a bit wrong given that we would be replacing an IntervalsSource with another IntervalsSource that has different matches. However this is something that could be implemented on top of intervals in query parsers by having an intermediate representation of IntervalsSources and push disjunctions to the top?

Alan Woodward added a comment - 05/Sep/18 13:01

I like jpountz's suggestion just to document this. Parser implementers can check for disjunctions with variable lengths and push those up the tree.

Martin Hermann or goller@detego-software.de would this work for you?

Alan Woodward added a comment - 05/Sep/18 13:01 I like jpountz 's suggestion just to document this. Parser implementers can check for disjunctions with variable lengths and push those up the tree. Martin Hermann or goller@detego-software.de would this work for you?

Alan Woodward added a comment - 15/Mar/19 17:28

Here is a proposal to fix this, using the new QueryVisitor API to work out if disjunctions have any sub-clauses with common first terms. Given an interval BLOCK(a,or(BLOCK(b,c),b),d) we can ensure that all matches are collected by rewriting things so that the final clause d is moved inside the disjunction, yielding BLOCK(a,or(BLOCK(b,c,d),BLOCK(b,d))). Checking for common prefixes means that intervals of the form BLOCK(a,or(BLOCK(b,c),d),e) don't need to be rewritten, which will be more efficient when the query is run as we only need to iterate positions for the final term once.

Alan Woodward added a comment - 15/Mar/19 17:28 Here is a proposal to fix this, using the new QueryVisitor API to work out if disjunctions have any sub-clauses with common first terms. Given an interval BLOCK(a,or(BLOCK(b,c),b),d) we can ensure that all matches are collected by rewriting things so that the final clause d is moved inside the disjunction, yielding BLOCK(a,or(BLOCK(b,c,d),BLOCK(b,d))) . Checking for common prefixes means that intervals of the form BLOCK(a,or(BLOCK(b,c),d),e) don't need to be rewritten, which will be more efficient when the query is run as we only need to iterate positions for the final term once.

Alan Woodward added a comment - 15/Mar/19 20:36

Here's a better patch, using term counting rather than prefix matching - the latter won't work if we have stacked tokens, for example, and this makes things much simpler.

Alan Woodward added a comment - 15/Mar/19 20:36 Here's a better patch, using term counting rather than prefix matching - the latter won't work if we have stacked tokens, for example, and this makes things much simpler.

Alan Woodward added a comment - 17/Mar/19 14:58

Another iteration. Instead of using term counting, we now compare minExtent(); anything greater than 1 is a candidate for rewriting, because it can lead to different-length overlaps in the disjunction due to term stacking, and minExtent() also takes into account things like Intervals.extend() which may only have a single term but can have a length of 2 or more. It also adds a new method to IntervalsSource, getDisjunctions(), which allows this rewriting to work even for filtered intervals like WITHIN or CONTAINING.

Alan Woodward added a comment - 17/Mar/19 14:58 Another iteration. Instead of using term counting, we now compare minExtent(); anything greater than 1 is a candidate for rewriting, because it can lead to different-length overlaps in the disjunction due to term stacking, and minExtent() also takes into account things like Intervals.extend() which may only have a single term but can have a length of 2 or more. It also adds a new method to IntervalsSource, getDisjunctions(), which allows this rewriting to work even for filtered intervals like WITHIN or CONTAINING.

Alan Woodward added a comment - 20/Mar/19 17:02

New patch, fixes a bug with how MAXGAPS and MAXWIDTH filters were dealing with inner disjunctions.

Alan Woodward added a comment - 20/Mar/19 17:02 New patch, fixes a bug with how MAXGAPS and MAXWIDTH filters were dealing with inner disjunctions.

Jim Ferenczi added a comment - 20/Mar/19 22:24

The patch fixes disjunctions that share a common prefix but the same problem can arise for disjunctions that share suffixes. For instance the query or(york, BLOCK(new, york)) has the same minimum interval semantic than "york". So a query like BLOCK(in, or(york, BLOCK(new, york))) will not match "in new york" because "new york" is discarded by the minimum interval "york". We could apply the same logic and rewrite the query automatically but I am sure we can find other pathological cases due to minimum interval semantics. IMO we should document this unintuitive behavior rather than rewriting all queries in a non-optimal form.

Jim Ferenczi added a comment - 20/Mar/19 22:24 The patch fixes disjunctions that share a common prefix but the same problem can arise for disjunctions that share suffixes. For instance the query or(york, BLOCK(new, york)) has the same minimum interval semantic than "york". So a query like BLOCK(in, or(york, BLOCK(new, york))) will not match "in new york" because "new york" is discarded by the minimum interval "york". We could apply the same logic and rewrite the query automatically but I am sure we can find other pathological cases due to minimum interval semantics. IMO we should document this unintuitive behavior rather than rewriting all queries in a non-optimal form.

Alan Woodward added a comment - 25/Mar/19 13:47

I've opened a PR to make discussing this easier, as it's grown to a fairly big change (although the public API is pretty much the same): https://github.com/apache/lucene-solr/pull/620

I agree that rewriting queries can be sub-optimal, but I think we still need to make it possible to get accurate hits, which
is currently difficult to do at construction time because disjunctions can end up being wrapped multiple times, and the implementing classes are all package-private so you can't just use instanceof checks.

My suggestion is that we automatically rewrite things to match accurately, but add a flag to Intervals.or() that allows you to opt out of the rewriting if you want speed above accuracy, or if you know that the members of a disjunction won't overlap (for example if you have no synonyms and so know that there are no stacked tokens).

Alan Woodward added a comment - 25/Mar/19 13:47 I've opened a PR to make discussing this easier, as it's grown to a fairly big change (although the public API is pretty much the same): https://github.com/apache/lucene-solr/pull/620 I agree that rewriting queries can be sub-optimal, but I think we still need to make it possible to get accurate hits, which is currently difficult to do at construction time because disjunctions can end up being wrapped multiple times, and the implementing classes are all package-private so you can't just use instanceof checks. My suggestion is that we automatically rewrite things to match accurately, but add a flag to Intervals.or() that allows you to opt out of the rewriting if you want speed above accuracy, or if you know that the members of a disjunction won't overlap (for example if you have no synonyms and so know that there are no stacked tokens).

ASF subversion and git services added a comment - 27/Mar/19 11:23

Commit f1782d0dd1195c823c79aab87529ebd7e8217b95 in lucene-solr's branch refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f1782d0 ]

~~LUCENE-8477~~: Automatically rewrite disjunctions when internal gaps matter (#620)

We have a number of IntervalsSource implementations where automatic minimization of
disjunctions can lead to surprising results:

PHRASE queries can miss matches because a longer matching sub-source is minimized
away, leaving a gap
MAXGAPS queries can miss matches for the same reason
CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
can miss matches if the 'big' interval gets minimized

The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
for each source that it would apply to.

Because these rewritten queries will in general be less efficient due to the duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
original), we also add an option to Intervals.or() that will prevent this happening, so that
consumers can choose speed over accuracy if it suits their usecase.

ASF subversion and git services added a comment - 27/Mar/19 11:23 Commit f1782d0dd1195c823c79aab87529ebd7e8217b95 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f1782d0 ] LUCENE-8477 : Automatically rewrite disjunctions when internal gaps matter (#620) We have a number of IntervalsSource implementations where automatic minimization of disjunctions can lead to surprising results: PHRASE queries can miss matches because a longer matching sub-source is minimized away, leaving a gap MAXGAPS queries can miss matches for the same reason CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries can miss matches if the 'big' interval gets minimized The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting for each source that it would apply to. Because these rewritten queries will in general be less efficient due to the duplication of effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the original), we also add an option to Intervals.or() that will prevent this happening, so that consumers can choose speed over accuracy if it suits their usecase.

ASF subversion and git services added a comment - 27/Mar/19 11:25

Commit 2571bf355f82e193d50ec2509f83ba698b262562 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2571bf3 ]

~~LUCENE-8477~~: Automatically rewrite disjunctions when internal gaps matter (#620)

We have a number of IntervalsSource implementations where automatic minimization of
disjunctions can lead to surprising results:

PHRASE queries can miss matches because a longer matching sub-source is minimized
away, leaving a gap
MAXGAPS queries can miss matches for the same reason
CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
can miss matches if the 'big' interval gets minimized

The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
for each source that it would apply to.

Because these rewritten queries will in general be less efficient due to the duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
original), we also add an option to Intervals.or() that will prevent this happening, so that
consumers can choose speed over accuracy if it suits their usecase.

ASF subversion and git services added a comment - 27/Mar/19 11:25 Commit 2571bf355f82e193d50ec2509f83ba698b262562 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2571bf3 ] LUCENE-8477 : Automatically rewrite disjunctions when internal gaps matter (#620) We have a number of IntervalsSource implementations where automatic minimization of disjunctions can lead to surprising results: PHRASE queries can miss matches because a longer matching sub-source is minimized away, leaving a gap MAXGAPS queries can miss matches for the same reason CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries can miss matches if the 'big' interval gets minimized The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting for each source that it would apply to. Because these rewritten queries will in general be less efficient due to the duplication of effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the original), we also add an option to Intervals.or() that will prevent this happening, so that consumers can choose speed over accuracy if it suits their usecase.

Alan Woodward added a comment - 27/Mar/19 11:26

Thanks for the reviews jim.ferenczi!

Alan Woodward added a comment - 27/Mar/19 11:26 Thanks for the reviews jim.ferenczi !

ASF subversion and git services added a comment - 27/Mar/19 11:29

Commit 0bb9b95ac7397e3c2d78b51cb49f7075ea1574e5 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0bb9b95 ]

~~LUCENE-8477~~: Add CHANGES entry

ASF subversion and git services added a comment - 27/Mar/19 11:29 Commit 0bb9b95ac7397e3c2d78b51cb49f7075ea1574e5 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0bb9b95 ] LUCENE-8477 : Add CHANGES entry

ASF subversion and git services added a comment - 27/Mar/19 11:29

Commit 3a63c58db3074d4a0a2dbf4cf3147f6d6cdf73ca in lucene-solr's branch refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3a63c58 ]

~~LUCENE-8477~~: Add CHANGES entry

ASF subversion and git services added a comment - 27/Mar/19 11:29 Commit 3a63c58db3074d4a0a2dbf4cf3147f6d6cdf73ca in lucene-solr's branch refs/heads/master from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3a63c58 ] LUCENE-8477 : Add CHANGES entry

ASF subversion and git services added a comment - 08/Apr/19 11:29

Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/master from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ]

~~LUCENE-8477~~: Restore public ctr for FilteredIntervalsSource

ASF subversion and git services added a comment - 08/Apr/19 11:29 Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/master from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ] LUCENE-8477 : Restore public ctr for FilteredIntervalsSource

ASF subversion and git services added a comment - 08/Apr/19 11:30

Commit e460356abeb1bd075a885d905a1d0873469bbd43 in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e460356 ]

~~LUCENE-8477~~: Restore public ctr for FilteredIntervalsSource

ASF subversion and git services added a comment - 08/Apr/19 11:30 Commit e460356abeb1bd075a885d905a1d0873469bbd43 in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e460356 ] LUCENE-8477 : Restore public ctr for FilteredIntervalsSource

ASF subversion and git services added a comment - 09/Apr/19 12:49

Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/jira/~~LUCENE-8738~~ from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ]

~~LUCENE-8477~~: Restore public ctr for FilteredIntervalsSource

ASF subversion and git services added a comment - 09/Apr/19 12:49 Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/jira/ LUCENE-8738 from Jim Ferenczi [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ] LUCENE-8477 : Restore public ctr for FilteredIntervalsSource

Tomoko Uchida added a comment - 28/Aug/22 15:35

This issue was moved to GitHub issue: #9523.

Tomoko Uchida added a comment - 28/Aug/22 15:35 This issue was moved to GitHub issue: #9523 .

Lucene - Core

Improve handling of inner disjunctions in intervals

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking