Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      SpanMultiTermQueryWrapper currently rewrites to a SpanOrQuery with as many clauses as there are matching terms. It would be nice to be able to limit this in a slightly nicer way than using TopTerms, which for most queries just translates to a lexicographical ordering.

      Attachments

        1. LUCENE-6513.patch
          12 kB
          Alan Woodward
        2. LUCENE-6513.patch
          12 kB
          Alan Woodward
        3. LUCENE-6513.patch
          10 kB
          Alan Woodward
        4. LUCENE-6513.patch
          13 kB
          Nishant Mehta
        5. LUCENE-6513.patch
          16 kB
          David Smiley

        Issue Links

          Activity

            romseygeek Alan Woodward added a comment -

            Patch. This adds a FrequentTerms rewrite to SpanMTQWrapper, which will rewrite the query using the terms with highest term frequencies.

            romseygeek Alan Woodward added a comment - Patch. This adds a FrequentTerms rewrite to SpanMTQWrapper, which will rewrite the query using the terms with highest term frequencies.
            rcmuir Robert Muir added a comment -

            Hmm, ideally we don't need to implement this as SpanOrQuery?

            In other words, I think the rewrite method should be TopTerms with a different comparison function...

            rcmuir Robert Muir added a comment - Hmm, ideally we don't need to implement this as SpanOrQuery? In other words, I think the rewrite method should be TopTerms with a different comparison function...
            romseygeek Alan Woodward added a comment -

            The problem with using TopTerms is that it requires that terms have the same comparison value across segments (which makes it more efficient, because it can only build TermContexts for candidate terms), but that doesn't hold for termFreq.

            romseygeek Alan Woodward added a comment - The problem with using TopTerms is that it requires that terms have the same comparison value across segments (which makes it more efficient, because it can only build TermContexts for candidate terms), but that doesn't hold for termFreq.
            rcmuir Robert Muir added a comment -

            Can we solve this somehow in a better way? thetaphi any ideas?

            rcmuir Robert Muir added a comment - Can we solve this somehow in a better way? thetaphi any ideas?
            uschindler Uwe Schindler added a comment -

            Not yet, I have to discuss with Alan in person. I don't yet undertand the problem of using another rewrite. It is better what we currently have in any case - if you are happy with the fact that not all terms are used in the query.

            uschindler Uwe Schindler added a comment - Not yet, I have to discuss with Alan in person. I don't yet undertand the problem of using another rewrite. It is better what we currently have in any case - if you are happy with the fact that not all terms are used in the query.
            romseygeek Alan Woodward added a comment -

            Updated to trunk. I don't think there's any easier way to solve this (the other SpanRewriteMethods end up building SpanOrQueries as well)

            romseygeek Alan Woodward added a comment - Updated to trunk. I don't think there's any easier way to solve this (the other SpanRewriteMethods end up building SpanOrQueries as well)
            romseygeek Alan Woodward added a comment -

            Waking this one up again, here's a different attempt at limiting MTQ expansion for span queries. We can't easily use TopTermsRewrite to select the most-frequent terms, because term frequency is an index-wide stat, and TopTerms only sees info from a single segment at a time. Instead, I'm using MultiTermsEnum to iterate through the matching terms, and recording the most frequent in a priority queue.

            The default is for SpanMTQWrapper to have no limits, but maybe we should change this to use the max boolean clause limit?

            romseygeek Alan Woodward added a comment - Waking this one up again, here's a different attempt at limiting MTQ expansion for span queries. We can't easily use TopTermsRewrite to select the most-frequent terms, because term frequency is an index-wide stat, and TopTerms only sees info from a single segment at a time. Instead, I'm using MultiTermsEnum to iterate through the matching terms, and recording the most frequent in a priority queue. The default is for SpanMTQWrapper to have no limits, but maybe we should change this to use the max boolean clause limit?

            romseygeek we've written a patch to solve this problem as well we've been meaning to share with the community. It goes about the solution in a bit of a different way. We'll try to get it up here in a day or two, though I'm not sure which approach will be preferable.

            Timothy055 Timothy M. Rodriguez added a comment - romseygeek we've written a patch to solve this problem as well we've been meaning to share with the community. It goes about the solution in a bit of a different way. We'll try to get it up here in a day or two, though I'm not sure which approach will be preferable.
            nmehta79 Nishant Mehta added a comment -

            Here's another approach we tried for limiting the MTQ expansion of span queries. FrequentTermsScoringRewrite collects terms using the ParallelArraysTermCollector in ScoringRewrite which is then added to the PriorityQueue in FrequentTermsScoringRewrite, overflows in the PriorityQueue are handled using the predicate supplied to FrequentTermsScoringRewrite. The default is DF_ORDER, but there's also a DF_THEN_TTF_ORDER provided for ties, and others could be added if needed. Then SpanQueryFrequentTermsScoringRewrite simply builds a SpanOrQuery out of the collected terms. romseygeek the test is inspired from your earlier patch. Thanks dsmiley for your prior feedback on this.

            nmehta79 Nishant Mehta added a comment - Here's another approach we tried for limiting the MTQ expansion of span queries. FrequentTermsScoringRewrite collects terms using the ParallelArraysTermCollector in ScoringRewrite which is then added to the PriorityQueue in FrequentTermsScoringRewrite, overflows in the PriorityQueue are handled using the predicate supplied to FrequentTermsScoringRewrite. The default is DF_ORDER, but there's also a DF_THEN_TTF_ORDER provided for ties, and others could be added if needed. Then SpanQueryFrequentTermsScoringRewrite simply builds a SpanOrQuery out of the collected terms. romseygeek the test is inspired from your earlier patch. Thanks dsmiley for your prior feedback on this.

            Apologies for the late alternative implementation. For what it's worth, we've been utilizing this patch for about a year and it's helped improve responsiveness to queries while limiting the expansions.

            Timothy055 Timothy M. Rodriguez added a comment - Apologies for the late alternative implementation. For what it's worth, we've been utilizing this patch for about a year and it's helped improve responsiveness to queries while limiting the expansions.
            dsmiley David Smiley added a comment -

            Here's an updated patch from the most recent one. I made small modifications to bring it up to date with master, and I added a bunch of nocommits at spots as reminders to ask questions here in JIRA, but most aren't so much true nocommits (most are not a blocker to me).

            Comments about existing stuff:

            • TopTermsRewrite: Why does this sort the terms before looping and calling addClause? https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L159 It's not a big deal but it's not clear what's the point. It appears that addClause, currently having no javadocs, ought to state that it is called in term order if it does matter. thetaphi thought you might know as you added this code. The test org.apache.lucene.search.TestMultiTermQueryRewrites#testRewritesWithDuplicateTerms expressly tests that the terms are in order.
            • SpanMultiTermQueryWrapper: shouldn't equals() consider the rewrite method?
            • SpanMultiTermQueryWrapper: suggest deprecate/removing setRewriteMethod(SpanRewriteMethod) and doing the same to the referenced inner class SpanRewriteMethod. getRewriteMethod can return MultiTermQuery.RewriteMethod. In short, we don't need a typed SpanQuery derivative of RewriteMethod. It's a bit limiting; it means we can't have an impl that subclasses another RewriteMethod like ScoringRewrite. Also the setter makes it mutable but I don't think we want mutable Query subclasses anymore.

            New stuff:

            • SpanQueryFrequentTermsScoringRewrite: should perhaps be an inner class of SpanMultiTermQueryWrapper like some of the existing ones?
            • SpanQueryFrequentTermsScoringRewrite: does it matter if the SpanOrQuery we build has it's SpanTermQuery clauses in lexicographic order or not? This kinda gets at my Q earlier about TopTermsRewrite sorting.
            • FrequentTermsScoringRewrite: probably needs equals & hashcode? Reference equality will have to do for the pluggable BiPredicate.
            • FrequentTermsScoringRewrite.ScoreTerm we could remove this if SpanTermQuery had getTermStates() like how TermQuery now has this method – LUCENE-8379.
            dsmiley David Smiley added a comment - Here's an updated patch from the most recent one. I made small modifications to bring it up to date with master, and I added a bunch of nocommits at spots as reminders to ask questions here in JIRA, but most aren't so much true nocommits (most are not a blocker to me). Comments about existing stuff: TopTermsRewrite: Why does this sort the terms before looping and calling addClause? https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L159 It's not a big deal but it's not clear what's the point. It appears that addClause, currently having no javadocs, ought to state that it is called in term order if it does matter. thetaphi thought you might know as you added this code. The test org.apache.lucene.search.TestMultiTermQueryRewrites#testRewritesWithDuplicateTerms expressly tests that the terms are in order. SpanMultiTermQueryWrapper: shouldn't equals() consider the rewrite method? SpanMultiTermQueryWrapper: suggest deprecate/removing setRewriteMethod(SpanRewriteMethod) and doing the same to the referenced inner class SpanRewriteMethod. getRewriteMethod can return MultiTermQuery.RewriteMethod. In short, we don't need a typed SpanQuery derivative of RewriteMethod. It's a bit limiting; it means we can't have an impl that subclasses another RewriteMethod like ScoringRewrite. Also the setter makes it mutable but I don't think we want mutable Query subclasses anymore. New stuff: SpanQueryFrequentTermsScoringRewrite: should perhaps be an inner class of SpanMultiTermQueryWrapper like some of the existing ones? SpanQueryFrequentTermsScoringRewrite: does it matter if the SpanOrQuery we build has it's SpanTermQuery clauses in lexicographic order or not? This kinda gets at my Q earlier about TopTermsRewrite sorting. FrequentTermsScoringRewrite: probably needs equals & hashcode? Reference equality will have to do for the pluggable BiPredicate. FrequentTermsScoringRewrite.ScoreTerm we could remove this if SpanTermQuery had getTermStates() like how TermQuery now has this method – LUCENE-8379 .
            uschindler Uwe Schindler added a comment -

            Hi David,

            I have to dig into the code to understand why it actually sorts. It might be a relic from earlier times, but there was for sure some reason about it. To figure that out, I will go through the history of this files. If sorting is not needed, we can remove the test, too.

            Uwe

            uschindler Uwe Schindler added a comment - Hi David, I have to dig into the code to understand why it actually sorts. It might be a relic from earlier times, but there was for sure some reason about it. To figure that out, I will go through the history of this files. If sorting is not needed, we can remove the test, too. Uwe
            dsmiley David Smiley added a comment -

            Note: the new PhraseWildcardQuery in LUCENE-8983 was developed to solve this problem. However it is not a "Spans" based Query and so I think the current issue is still valid. If I have time I'll show something for this issue.

            dsmiley David Smiley added a comment - Note: the new PhraseWildcardQuery in LUCENE-8983 was developed to solve this problem. However it is not a "Spans" based Query and so I think the current issue is still valid. If I have time I'll show something for this issue.

            I think it's rather better to fix highlighting for intervals, where multiterms are bounded.

            mkhl Mikhail Khludnev added a comment - I think it's rather better to fix highlighting for intervals, where multiterms are bounded.
            tomoko Tomoko Uchida added a comment -

            This issue was moved to GitHub issue: #7571.

            tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #7571 .

            People

              dsmiley David Smiley
              romseygeek Alan Woodward
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: