Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
The current implementation of the disjunction interval produced by Intervals.or is a direct implementation of the OR operator from the Vigna paper. This produces minimal intervals, meaning that (a) is preferred over (a b), and (b) also over (a b). This has advantages when it comes to counting intervals for scoring, but also has drawbacks when it comes to matching. For example, a phrase query for ((a OR (a b)) BLOCK (c)) will not match the document (a b c), because (a) will be preferred over (a b), and (a c) does not match.
This ticket is to discuss the best way of dealing with disjunctions.
Attachments
Attachments
- LUCENE-8477.patch
- 45 kB
- Alan Woodward
- LUCENE-8477.patch
- 40 kB
- Alan Woodward
- LUCENE-8477.patch
- 16 kB
- Alan Woodward
- LUCENE-8477.patch
- 19 kB
- Alan Woodward
Issue Links
- links to
Activity
I'd be tempted to just document this behavior for now. I'm afraid that introducing non-minimized intervals will introduce similar corner-cases to what we have with spans and sloppy phrase queries?
Rewriting automatically feels a bit wrong given that we would be replacing an IntervalsSource with another IntervalsSource that has different matches. However this is something that could be implemented on top of intervals in query parsers by having an intermediate representation of IntervalsSources and push disjunctions to the top?
I like jpountz's suggestion just to document this. Parser implementers can check for disjunctions with variable lengths and push those up the tree.
Martin Hermann or goller@detego-software.de would this work for you?
Here is a proposal to fix this, using the new QueryVisitor API to work out if disjunctions have any sub-clauses with common first terms. Given an interval BLOCK(a,or(BLOCK(b,c),b),d) we can ensure that all matches are collected by rewriting things so that the final clause d is moved inside the disjunction, yielding BLOCK(a,or(BLOCK(b,c,d),BLOCK(b,d))). Checking for common prefixes means that intervals of the form BLOCK(a,or(BLOCK(b,c),d),e) don't need to be rewritten, which will be more efficient when the query is run as we only need to iterate positions for the final term once.
Here's a better patch, using term counting rather than prefix matching - the latter won't work if we have stacked tokens, for example, and this makes things much simpler.
Another iteration. Instead of using term counting, we now compare minExtent(); anything greater than 1 is a candidate for rewriting, because it can lead to different-length overlaps in the disjunction due to term stacking, and minExtent() also takes into account things like Intervals.extend() which may only have a single term but can have a length of 2 or more. It also adds a new method to IntervalsSource, getDisjunctions(), which allows this rewriting to work even for filtered intervals like WITHIN or CONTAINING.
New patch, fixes a bug with how MAXGAPS and MAXWIDTH filters were dealing with inner disjunctions.
The patch fixes disjunctions that share a common prefix but the same problem can arise for disjunctions that share suffixes. For instance the query or(york, BLOCK(new, york)) has the same minimum interval semantic than "york". So a query like BLOCK(in, or(york, BLOCK(new, york))) will not match "in new york" because "new york" is discarded by the minimum interval "york". We could apply the same logic and rewrite the query automatically but I am sure we can find other pathological cases due to minimum interval semantics. IMO we should document this unintuitive behavior rather than rewriting all queries in a non-optimal form.
I've opened a PR to make discussing this easier, as it's grown to a fairly big change (although the public API is pretty much the same): https://github.com/apache/lucene-solr/pull/620
I agree that rewriting queries can be sub-optimal, but I think we still need to make it possible to get accurate hits, which
is currently difficult to do at construction time because disjunctions can end up being wrapped multiple times, and the implementing classes are all package-private so you can't just use instanceof checks.
My suggestion is that we automatically rewrite things to match accurately, but add a flag to Intervals.or() that allows you to opt out of the rewriting if you want speed above accuracy, or if you know that the members of a disjunction won't overlap (for example if you have no synonyms and so know that there are no stacked tokens).
Commit f1782d0dd1195c823c79aab87529ebd7e8217b95 in lucene-solr's branch refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f1782d0 ]
LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter (#620)
We have a number of IntervalsSource implementations where automatic minimization of
disjunctions can lead to surprising results:
- PHRASE queries can miss matches because a longer matching sub-source is minimized
away, leaving a gap - MAXGAPS queries can miss matches for the same reason
- CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
can miss matches if the 'big' interval gets minimized
The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
for each source that it would apply to.
Because these rewritten queries will in general be less efficient due to the duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
original), we also add an option to Intervals.or() that will prevent this happening, so that
consumers can choose speed over accuracy if it suits their usecase.
Commit 2571bf355f82e193d50ec2509f83ba698b262562 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2571bf3 ]
LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter (#620)
We have a number of IntervalsSource implementations where automatic minimization of
disjunctions can lead to surprising results:
- PHRASE queries can miss matches because a longer matching sub-source is minimized
away, leaving a gap - MAXGAPS queries can miss matches for the same reason
- CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
can miss matches if the 'big' interval gets minimized
The proper way to deal with this is to rewrite the queries by pulling disjunctions to the top
of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten to
OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, we need to
add a new pullUpDisjunctions() method to IntervalsSource that performs this rewriting
for each source that it would apply to.
Because these rewritten queries will in general be less efficient due to the duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 4 in the
original), we also add an option to Intervals.or() that will prevent this happening, so that
consumers can choose speed over accuracy if it suits their usecase.
Commit 0bb9b95ac7397e3c2d78b51cb49f7075ea1574e5 in lucene-solr's branch refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0bb9b95 ]
LUCENE-8477: Add CHANGES entry
Commit 3a63c58db3074d4a0a2dbf4cf3147f6d6cdf73ca in lucene-solr's branch refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3a63c58 ]
LUCENE-8477: Add CHANGES entry
Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/master from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ]
LUCENE-8477: Restore public ctr for FilteredIntervalsSource
Commit e460356abeb1bd075a885d905a1d0873469bbd43 in lucene-solr's branch refs/heads/branch_8x from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e460356 ]
LUCENE-8477: Restore public ctr for FilteredIntervalsSource
Commit c1222b57e940f108cb3f5b8f720a910a5fb35126 in lucene-solr's branch refs/heads/jira/LUCENE-8738 from Jim Ferenczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c1222b5 ]
LUCENE-8477: Restore public ctr for FilteredIntervalsSource
I can see a couple of options here:
1) Add a new operator, OR_MAX, which doesn't try to minimize its internals, and sorts prefixes last. This deals with ((a OR (a b)) BLOCK c) mentioned in the description, but it still fails to match in other situations, such as (b OR (b c)) BLOCK c - in this case because (b c) will sort before (b), so the interval will try to match (b c c). It also makes it less easy to use, as consumers now need to understand the semantics of two separate OR operators
2) Allow IntervalsSource to rewrite itself, so that ((a OR (a b)) BLOCK c) becomes (a BLOCK c) OR ((a b) BLOCK c). This would be a lot easier on the user, but I'm not sure how easy it would be from an implementation point of view - it may end up adding lots of extra methods to IntervalsSource.