Lucene - Core
  1. Lucene - Core
  2. LUCENE-6360

TermsQuery should rewrite to a ConstantScoreQuery over a BooleanQuery when there are few terms

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TermsQuery helps when there are lot of terms from which you would like to compute the union, but it is a bit harmful when you have few terms since it cannot really skip: it always consumes all documents matching the underlying terms.

      It would certainly help to rewrite this query to a ConstantScoreQuery over a BooleanQuery when there are few terms in order to have actual skip support.

      As usual the hard part is probably to figure out the threshold.

      1. LUCENE-6360.patch
        21 kB
        Adrien Grand
      2. LUCENE-6360.patch
        19 kB
        Adrien Grand
      3. LUCENE-6360.patch
        14 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Paul Elschot added a comment -

          I wonder whether a compressing DocIdSet could also help here.
          EliasFanoDocIdSet uses an internal threshold for the number of matching docs, and above that threshold it changes itself to a bitset.
          The tradeoff for this is not directly related to skipping because building the set requires all matching docs.
          But a small compressing docidset skips/advances faster than a bitset.

          Some of this can be estimated in advance by the doc frequencies of the terms involved.

          To figure out the threshold(s), real life test cases would be helpful.
          Do you have some in mind already?

          Show
          Paul Elschot added a comment - I wonder whether a compressing DocIdSet could also help here. EliasFanoDocIdSet uses an internal threshold for the number of matching docs, and above that threshold it changes itself to a bitset. The tradeoff for this is not directly related to skipping because building the set requires all matching docs. But a small compressing docidset skips/advances faster than a bitset. Some of this can be estimated in advance by the doc frequencies of the terms involved. To figure out the threshold(s), real life test cases would be helpful. Do you have some in mind already?
          Hide
          Adrien Grand added a comment -

          Sorry Paul, I had missed your comment.

          I wonder whether a compressing DocIdSet could also help here.

          This is something we are already doing: we use BitDocIdSetBuilder which internally starts with a sparse bit set and upgrades to a dense FixedBitSet if the cardinality becomes high. I agree this is already a win. But having actual skipping support would be even better? Ie. if you intersect with a sparse filter, you would not even need to iterate over documents that don't match the filter.

          To figure out the threshold(s), real life test cases would be helpful. Do you have some in mind already?

          My current idea is to have the same threshold here and when running MultiTermQueries, LUCENE-6458

          Show
          Adrien Grand added a comment - Sorry Paul, I had missed your comment. I wonder whether a compressing DocIdSet could also help here. This is something we are already doing: we use BitDocIdSetBuilder which internally starts with a sparse bit set and upgrades to a dense FixedBitSet if the cardinality becomes high. I agree this is already a win. But having actual skipping support would be even better? Ie. if you intersect with a sparse filter, you would not even need to iterate over documents that don't match the filter. To figure out the threshold(s), real life test cases would be helpful. Do you have some in mind already? My current idea is to have the same threshold here and when running MultiTermQueries, LUCENE-6458
          Hide
          Adrien Grand added a comment -

          Here is a patch.
          1. TermsQuery rewrites to a ConstantScoreQuery(BooleanQuery) when there are 16 terms or less.
          2. If there are more than 16 terms, then still per segment TermsQuery will first try to collect terms and if there are less than 16 terms that exist on this segment, it will also execute like a disjunction.

          I took 16 as a threshold to use the same as MultiTermQueryConstantScoreWrapper.

          1. is mostly useful for caching so that if you have a short TermsQuery it will share the same cache key as a disjunction around the same terms.

          Show
          Adrien Grand added a comment - Here is a patch. 1. TermsQuery rewrites to a ConstantScoreQuery(BooleanQuery) when there are 16 terms or less. 2. If there are more than 16 terms, then still per segment TermsQuery will first try to collect terms and if there are less than 16 terms that exist on this segment, it will also execute like a disjunction. I took 16 as a threshold to use the same as MultiTermQueryConstantScoreWrapper. 1. is mostly useful for caching so that if you have a short TermsQuery it will share the same cache key as a disjunction around the same terms.
          Hide
          David Smiley added a comment -

          I reviewed the patch. The code for #2 is pretty interesting. I learn a lot by reading your patches.

          I noticed what looks like a bug in TermsQuery.createWeight.scorer:

                    // comparing references is fine here
                    if (field != lastField) {
                      terms = fields.terms(field);
                      if (terms == null) {
                        termsEnum = null;
                      } else {
                        termsEnum = terms.iterator();
                      }
                    }
          

          Your code didn't introduce that, and yes I can read the comment lastField is never updated, so it will fetch terms.iterator() every time.

          Secondly... I think the needsScores param should arguably not pass through to the ConstantScoreQuery wrapped BooleanQuery, since this should be constant scoring; no? Or maybe it's moot since it's CSQ after all.

          Did you mean for the changes in CoalescedUpdates and FrozenBufferedUpdates to be in this patch?

          Show
          David Smiley added a comment - I reviewed the patch. The code for #2 is pretty interesting. I learn a lot by reading your patches. I noticed what looks like a bug in TermsQuery.createWeight.scorer: // comparing references is fine here if (field != lastField) { terms = fields.terms(field); if (terms == null ) { termsEnum = null ; } else { termsEnum = terms.iterator(); } } Your code didn't introduce that, and yes I can read the comment lastField is never updated, so it will fetch terms.iterator() every time. Secondly... I think the needsScores param should arguably not pass through to the ConstantScoreQuery wrapped BooleanQuery, since this should be constant scoring; no? Or maybe it's moot since it's CSQ after all. Did you mean for the changes in CoalescedUpdates and FrozenBufferedUpdates to be in this patch?
          Hide
          Adrien Grand added a comment -

          I noticed what looks like a bug in TermsQuery.createWeight.scorer:

          Good catch, it's a bad bug indeed! Here is an updated patch with a test that we only pull one iterator per unique field.

          Secondly... I think the needsScores param should arguably not pass through to the ConstantScoreQuery wrapped BooleanQuery, since this should be constant scoring; no? Or maybe it's moot since it's CSQ after all.

          Actually I think it does need to pass through to the CSQ. The current contract is that if you pass needsScores=false then scores are going to be undefined, so if the user passed needsScores=true we need to make sure that we build a query that will return the same scores.

          By the way if we did not, it would probably break the scores since ConstantScoreQuery.createWeight returns the inner weight directly when scores are not needed.

          Did you mean for the changes in CoalescedUpdates and FrozenBufferedUpdates to be in this patch?

          Yes: I needed to have the term count to rewrite so I changed PrefixCodedTerms to store the number of wrapped terms and then noticed that these other classes were maintaining this number of terms on the side, so I refactored them to use PrefixCodedTerms.size() instead?

          Show
          Adrien Grand added a comment - I noticed what looks like a bug in TermsQuery.createWeight.scorer: Good catch, it's a bad bug indeed! Here is an updated patch with a test that we only pull one iterator per unique field. Secondly... I think the needsScores param should arguably not pass through to the ConstantScoreQuery wrapped BooleanQuery, since this should be constant scoring; no? Or maybe it's moot since it's CSQ after all. Actually I think it does need to pass through to the CSQ. The current contract is that if you pass needsScores=false then scores are going to be undefined, so if the user passed needsScores=true we need to make sure that we build a query that will return the same scores. By the way if we did not, it would probably break the scores since ConstantScoreQuery.createWeight returns the inner weight directly when scores are not needed. Did you mean for the changes in CoalescedUpdates and FrozenBufferedUpdates to be in this patch? Yes: I needed to have the term count to rewrite so I changed PrefixCodedTerms to store the number of wrapped terms and then noticed that these other classes were maintaining this number of terms on the side, so I refactored them to use PrefixCodedTerms.size() instead?
          Hide
          David Smiley added a comment -

          Nice; looks great now. Thanks for the explanations.

          I doubt the optimal threshold for the rewrite() is the same as the optimal threshold to return a per-segment scorer... but using the same is fine.

          Show
          David Smiley added a comment - Nice; looks great now. Thanks for the explanations. I doubt the optimal threshold for the rewrite() is the same as the optimal threshold to return a per-segment scorer... but using the same is fine.
          Hide
          Adrien Grand added a comment -

          I doubt the optimal threshold for the rewrite() is the same as the optimal threshold to return a per-segment scorer... but using the same is fine.

          Agreed. The way I see it is that we have to start somewhere and starting small has the benefit of putting us on the safe side.

          Show
          Adrien Grand added a comment - I doubt the optimal threshold for the rewrite() is the same as the optimal threshold to return a per-segment scorer... but using the same is fine. Agreed. The way I see it is that we have to start somewhere and starting small has the benefit of putting us on the safe side.
          Hide
          David Smiley added a comment -

          Like in MultiTermQueryConstantScoreWrapper ( https://issues.apache.org/jira/browse/LUCENE-6458?focusedCommentId=14554181#comment-14554181 ) which will soon have a specialized case for bulkScorer, should there similarly be a bulkScorer impl here?

          Show
          David Smiley added a comment - Like in MultiTermQueryConstantScoreWrapper ( https://issues.apache.org/jira/browse/LUCENE-6458?focusedCommentId=14554181#comment-14554181 ) which will soon have a specialized case for bulkScorer, should there similarly be a bulkScorer impl here?
          Hide
          Adrien Grand added a comment -

          Yes indeed. I'm working on it right now.

          Show
          Adrien Grand added a comment - Yes indeed. I'm working on it right now.
          Hide
          Adrien Grand added a comment -

          Here is an updated patch. It works in a very similar way to MultiTermQueryConstantScoreWrapper for BulkScorer delegation.

          Show
          Adrien Grand added a comment - Here is an updated patch. It works in a very similar way to MultiTermQueryConstantScoreWrapper for BulkScorer delegation.
          Hide
          ASF subversion and git services added a comment -

          Commit 1680946 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1680946 ]

          LUCENE-6360: Make TermsQuery rewrite as a disjunction when there are few terms.

          Show
          ASF subversion and git services added a comment - Commit 1680946 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1680946 ] LUCENE-6360 : Make TermsQuery rewrite as a disjunction when there are few terms.
          Hide
          ASF subversion and git services added a comment -

          Commit 1680947 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1680947 ]

          LUCENE-6360: Add CHANGES entry.

          Show
          ASF subversion and git services added a comment - Commit 1680947 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1680947 ] LUCENE-6360 : Add CHANGES entry.
          Hide
          ASF subversion and git services added a comment -

          Commit 1680948 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1680948 ]

          LUCENE-6360: Make TermsQuery rewrite as a disjunction when there are few terms.

          Show
          ASF subversion and git services added a comment - Commit 1680948 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1680948 ] LUCENE-6360 : Make TermsQuery rewrite as a disjunction when there are few terms.
          Hide
          Adrien Grand added a comment -

          Thanks again David!

          Show
          Adrien Grand added a comment - Thanks again David!
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development