[LUCENE-6513] Allow limits on SpanMultiTermQueryWrapper expansion - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

SpanMultiTermQueryWrapper currently rewrites to a SpanOrQuery with as many clauses as there are matching terms. It would be nice to be able to limit this in a slightly nicer way than using TopTerms, which for most queries just translates to a lexicographical ordering.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-6513.patch
01/Jun/15 10:50
12 kB
Alan Woodward
LUCENE-6513.patch
28/Jul/15 15:12
12 kB
Alan Woodward
LUCENE-6513.patch
05/Jul/17 14:59
10 kB
Alan Woodward
LUCENE-6513.patch
06/Jul/17 22:47
13 kB
Nishant Mehta
LUCENE-6513.patch
05/Jul/18 22:27
16 kB
David Smiley

Issue Links

is related to

LUCENE-8983 PhraseWildcardQuery - new query to control and optimize wildcard expansions in phrase

Closed

relates to

LUCENE-8284 Add MultiTermsIntervalsSource

Patch Available

Activity

Ascending order - Click to sort in descending order

Alan Woodward added a comment - 01/Jun/15 10:50

Patch. This adds a FrequentTerms rewrite to SpanMTQWrapper, which will rewrite the query using the terms with highest term frequencies.

Alan Woodward added a comment - 01/Jun/15 10:50 Patch. This adds a FrequentTerms rewrite to SpanMTQWrapper, which will rewrite the query using the terms with highest term frequencies.

Robert Muir added a comment - 01/Jun/15 11:44

Hmm, ideally we don't need to implement this as SpanOrQuery?

In other words, I think the rewrite method should be TopTerms with a different comparison function...

Robert Muir added a comment - 01/Jun/15 11:44 Hmm, ideally we don't need to implement this as SpanOrQuery? In other words, I think the rewrite method should be TopTerms with a different comparison function...

Alan Woodward added a comment - 01/Jun/15 12:27

The problem with using TopTerms is that it requires that terms have the same comparison value across segments (which makes it more efficient, because it can only build TermContexts for candidate terms), but that doesn't hold for termFreq.

Alan Woodward added a comment - 01/Jun/15 12:27 The problem with using TopTerms is that it requires that terms have the same comparison value across segments (which makes it more efficient, because it can only build TermContexts for candidate terms), but that doesn't hold for termFreq.

Robert Muir added a comment - 01/Jun/15 12:44

Can we solve this somehow in a better way? thetaphi any ideas?

Robert Muir added a comment - 01/Jun/15 12:44 Can we solve this somehow in a better way? thetaphi any ideas?

Uwe Schindler added a comment - 01/Jun/15 15:33

Not yet, I have to discuss with Alan in person. I don't yet undertand the problem of using another rewrite. It is better what we currently have in any case - if you are happy with the fact that not all terms are used in the query.

Uwe Schindler added a comment - 01/Jun/15 15:33 Not yet, I have to discuss with Alan in person. I don't yet undertand the problem of using another rewrite. It is better what we currently have in any case - if you are happy with the fact that not all terms are used in the query.

Alan Woodward added a comment - 28/Jul/15 15:12

Updated to trunk. I don't think there's any easier way to solve this (the other SpanRewriteMethods end up building SpanOrQueries as well)

Alan Woodward added a comment - 28/Jul/15 15:12 Updated to trunk. I don't think there's any easier way to solve this (the other SpanRewriteMethods end up building SpanOrQueries as well)

Alan Woodward added a comment - 05/Jul/17 15:04

Waking this one up again, here's a different attempt at limiting MTQ expansion for span queries. We can't easily use TopTermsRewrite to select the most-frequent terms, because term frequency is an index-wide stat, and TopTerms only sees info from a single segment at a time. Instead, I'm using MultiTermsEnum to iterate through the matching terms, and recording the most frequent in a priority queue.

The default is for SpanMTQWrapper to have no limits, but maybe we should change this to use the max boolean clause limit?

Alan Woodward added a comment - 05/Jul/17 15:04 Waking this one up again, here's a different attempt at limiting MTQ expansion for span queries. We can't easily use TopTermsRewrite to select the most-frequent terms, because term frequency is an index-wide stat, and TopTerms only sees info from a single segment at a time. Instead, I'm using MultiTermsEnum to iterate through the matching terms, and recording the most frequent in a priority queue. The default is for SpanMTQWrapper to have no limits, but maybe we should change this to use the max boolean clause limit?

Timothy M. Rodriguez added a comment - 05/Jul/17 23:16

romseygeek we've written a patch to solve this problem as well we've been meaning to share with the community. It goes about the solution in a bit of a different way. We'll try to get it up here in a day or two, though I'm not sure which approach will be preferable.

Timothy M. Rodriguez added a comment - 05/Jul/17 23:16 romseygeek we've written a patch to solve this problem as well we've been meaning to share with the community. It goes about the solution in a bit of a different way. We'll try to get it up here in a day or two, though I'm not sure which approach will be preferable.

Nishant Mehta added a comment - 06/Jul/17 22:46

Here's another approach we tried for limiting the MTQ expansion of span queries. FrequentTermsScoringRewrite collects terms using the ParallelArraysTermCollector in ScoringRewrite which is then added to the PriorityQueue in FrequentTermsScoringRewrite, overflows in the PriorityQueue are handled using the predicate supplied to FrequentTermsScoringRewrite. The default is DF_ORDER, but there's also a DF_THEN_TTF_ORDER provided for ties, and others could be added if needed. Then SpanQueryFrequentTermsScoringRewrite simply builds a SpanOrQuery out of the collected terms. romseygeek the test is inspired from your earlier patch. Thanks dsmiley for your prior feedback on this.

Nishant Mehta added a comment - 06/Jul/17 22:46 Here's another approach we tried for limiting the MTQ expansion of span queries. FrequentTermsScoringRewrite collects terms using the ParallelArraysTermCollector in ScoringRewrite which is then added to the PriorityQueue in FrequentTermsScoringRewrite, overflows in the PriorityQueue are handled using the predicate supplied to FrequentTermsScoringRewrite. The default is DF_ORDER, but there's also a DF_THEN_TTF_ORDER provided for ties, and others could be added if needed. Then SpanQueryFrequentTermsScoringRewrite simply builds a SpanOrQuery out of the collected terms. romseygeek the test is inspired from your earlier patch. Thanks dsmiley for your prior feedback on this.

Timothy M. Rodriguez added a comment - 02/Oct/17 21:52

Apologies for the late alternative implementation. For what it's worth, we've been utilizing this patch for about a year and it's helped improve responsiveness to queries while limiting the expansions.

Timothy M. Rodriguez added a comment - 02/Oct/17 21:52 Apologies for the late alternative implementation. For what it's worth, we've been utilizing this patch for about a year and it's helped improve responsiveness to queries while limiting the expansions.

David Smiley added a comment - 05/Jul/18 22:30

Here's an updated patch from the most recent one. I made small modifications to bring it up to date with master, and I added a bunch of nocommits at spots as reminders to ask questions here in JIRA, but most aren't so much true nocommits (most are not a blocker to me).

Comments about existing stuff:

TopTermsRewrite: Why does this sort the terms before looping and calling addClause? https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L159 It's not a big deal but it's not clear what's the point. It appears that addClause, currently having no javadocs, ought to state that it is called in term order if it does matter. thetaphi thought you might know as you added this code. The test org.apache.lucene.search.TestMultiTermQueryRewrites#testRewritesWithDuplicateTerms expressly tests that the terms are in order.
SpanMultiTermQueryWrapper: shouldn't equals() consider the rewrite method?
SpanMultiTermQueryWrapper: suggest deprecate/removing setRewriteMethod(SpanRewriteMethod) and doing the same to the referenced inner class SpanRewriteMethod. getRewriteMethod can return MultiTermQuery.RewriteMethod. In short, we don't need a typed SpanQuery derivative of RewriteMethod. It's a bit limiting; it means we can't have an impl that subclasses another RewriteMethod like ScoringRewrite. Also the setter makes it mutable but I don't think we want mutable Query subclasses anymore.

New stuff:

SpanQueryFrequentTermsScoringRewrite: should perhaps be an inner class of SpanMultiTermQueryWrapper like some of the existing ones?
SpanQueryFrequentTermsScoringRewrite: does it matter if the SpanOrQuery we build has it's SpanTermQuery clauses in lexicographic order or not? This kinda gets at my Q earlier about TopTermsRewrite sorting.
FrequentTermsScoringRewrite: probably needs equals & hashcode? Reference equality will have to do for the pluggable BiPredicate.
FrequentTermsScoringRewrite.ScoreTerm we could remove this if SpanTermQuery had getTermStates() like how TermQuery now has this method – ~~LUCENE-8379~~.

David Smiley added a comment - 05/Jul/18 22:30 Here's an updated patch from the most recent one. I made small modifications to bring it up to date with master, and I added a bunch of nocommits at spots as reminders to ask questions here in JIRA, but most aren't so much true nocommits (most are not a blocker to me). Comments about existing stuff: TopTermsRewrite: Why does this sort the terms before looping and calling addClause? https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L159 It's not a big deal but it's not clear what's the point. It appears that addClause, currently having no javadocs, ought to state that it is called in term order if it does matter. thetaphi thought you might know as you added this code. The test org.apache.lucene.search.TestMultiTermQueryRewrites#testRewritesWithDuplicateTerms expressly tests that the terms are in order. SpanMultiTermQueryWrapper: shouldn't equals() consider the rewrite method? SpanMultiTermQueryWrapper: suggest deprecate/removing setRewriteMethod(SpanRewriteMethod) and doing the same to the referenced inner class SpanRewriteMethod. getRewriteMethod can return MultiTermQuery.RewriteMethod. In short, we don't need a typed SpanQuery derivative of RewriteMethod. It's a bit limiting; it means we can't have an impl that subclasses another RewriteMethod like ScoringRewrite. Also the setter makes it mutable but I don't think we want mutable Query subclasses anymore. New stuff: SpanQueryFrequentTermsScoringRewrite: should perhaps be an inner class of SpanMultiTermQueryWrapper like some of the existing ones? SpanQueryFrequentTermsScoringRewrite: does it matter if the SpanOrQuery we build has it's SpanTermQuery clauses in lexicographic order or not? This kinda gets at my Q earlier about TopTermsRewrite sorting. FrequentTermsScoringRewrite: probably needs equals & hashcode? Reference equality will have to do for the pluggable BiPredicate. FrequentTermsScoringRewrite.ScoreTerm we could remove this if SpanTermQuery had getTermStates() like how TermQuery now has this method – LUCENE-8379 .

Uwe Schindler added a comment - 07/Jul/18 21:25

Hi David,

I have to dig into the code to understand why it actually sorts. It might be a relic from earlier times, but there was for sure some reason about it. To figure that out, I will go through the history of this files. If sorting is not needed, we can remove the test, too.

Uwe

Uwe Schindler added a comment - 07/Jul/18 21:25 Hi David, I have to dig into the code to understand why it actually sorts. It might be a relic from earlier times, but there was for sure some reason about it. To figure that out, I will go through the history of this files. If sorting is not needed, we can remove the test, too. Uwe

David Smiley added a comment - 15/Nov/19 19:48

Note: the new PhraseWildcardQuery in ~~LUCENE-8983~~ was developed to solve this problem. However it is not a "Spans" based Query and so I think the current issue is still valid. If I have time I'll show something for this issue.

David Smiley added a comment - 15/Nov/19 19:48 Note: the new PhraseWildcardQuery in LUCENE-8983 was developed to solve this problem. However it is not a "Spans" based Query and so I think the current issue is still valid. If I have time I'll show something for this issue.

Mikhail Khludnev added a comment - 15/Nov/19 21:41

I think it's rather better to fix highlighting for intervals, where multiterms are bounded.

Mikhail Khludnev added a comment - 15/Nov/19 21:41 I think it's rather better to fix highlighting for intervals, where multiterms are bounded.

Tomoko Uchida added a comment - 28/Aug/22 14:33

This issue was moved to GitHub issue: #7571.

Tomoko Uchida added a comment - 28/Aug/22 14:33 This issue was moved to GitHub issue: #7571 .

People

Assignee:: David Smiley

Reporter:: Alan Woodward

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 01/Jun/15 10:48

Updated:: 28/Aug/22 14:33

Lucene - Core

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates