Lucene - Core
  1. Lucene - Core
  2. LUCENE-4628

Add common terms query to gracefully handle very high frequent terms dynamically

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I had this problem quite a couple of times the last couple of month that searches very often contained super high frequent terms and disjunction queries became way too slow. The main problem was that stopword filtering wasn't really an option since in the domain those high-freq terms where not really stopwords though. So for instance searching for a song title "this is it" or for a band "A" didn't really fly with stopwords. I thought about that for a while and came up with a query based solution that decides based on a threshold if something is considered a stopword or not and if so it moves the term in two boolean queries one for high-frequent and one for low-frequent such that those high frequent terms are only matched if the low-frequent sub-query produces a match. Yet if all terms are high frequent it makes the entire thing a Conjunction which gave me reasonable results as well as performance.

      1. LUCENE-4628.patch
        27 kB
        Simon Willnauer
      2. LUCENE-4628.patch
        24 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a patch as a start. this query does a lot of things right now maybe we can break something out of it.

        Show
        Simon Willnauer added a comment - here is a patch as a start. this query does a lot of things right now maybe we can break something out of it.
        Hide
        Mark Miller added a comment -

        +1, nice.

        Show
        Mark Miller added a comment - +1, nice.
        Hide
        Robert Muir added a comment -

        I like the idea too.

        Maybe add a QueryUtils.check to the test since its a new query (just for equals/hashcode checking)
        Should we throw IAE in the ctor if the Occur is MUST_NOT?
        Maybe make lowFreqBoost and highFreqBoost protected in case someone wants to subclass and override just buildQuery to customize what it makes?

        Show
        Robert Muir added a comment - I like the idea too. Maybe add a QueryUtils.check to the test since its a new query (just for equals/hashcode checking) Should we throw IAE in the ctor if the Occur is MUST_NOT? Maybe make lowFreqBoost and highFreqBoost protected in case someone wants to subclass and override just buildQuery to customize what it makes?
        Hide
        Simon Willnauer added a comment -

        next iteration...

        • added more tests including QueryUtils.check (thanks rob good point)
        • throw IAE if MUST_NOT is passed and added javadocs for it
        • made low/highFreqBoost protected and removed setters & getters for it.
        • added changes entry.

        This lives now in the query module and I think that fine for that query we can still move it into core if needed.

        Show
        Simon Willnauer added a comment - next iteration... added more tests including QueryUtils.check (thanks rob good point) throw IAE if MUST_NOT is passed and added javadocs for it made low/highFreqBoost protected and removed setters & getters for it. added changes entry. This lives now in the query module and I think that fine for that query we can still move it into core if needed.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Simon Willnauer
        http://svn.apache.org/viewvc?view=revision&revision=1421743

        LUCENE-4628: Added CommonTermsQuery

        Show
        Commit Tag Bot added a comment - [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1421743 LUCENE-4628 : Added CommonTermsQuery
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Simon Willnauer
        http://svn.apache.org/viewvc?view=revision&revision=1421763

        LUCENE-4628: Added CommonTermsQuery

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1421763 LUCENE-4628 : Added CommonTermsQuery
        Hide
        David Smiley added a comment -

        Why not just use CommonGrams?

        Show
        David Smiley added a comment - Why not just use CommonGrams?
        Hide
        Simon Willnauer added a comment -

        hey david, I suggest you look a bit closer at the code. This serves a similar usecase but it's way more powerful. here are some differences:

        • CommonGrams need to be build at index and query time - CommonTermsQuery can be used on any index
        • CommonGrams need a prebuild dictionary of high freq terms - CommonTermsQuery can efficiently detect high freq terms at query time and "rewrite" the query
        • CommonGrams can not search for high freq terms in conjunction - CommonTermsQuery can do stuff like "this is it" or "to be or not to be" in a conjunction query and still be reasonable in terms of performance.

        makes sense?

        Show
        Simon Willnauer added a comment - hey david, I suggest you look a bit closer at the code. This serves a similar usecase but it's way more powerful. here are some differences: CommonGrams need to be build at index and query time - CommonTermsQuery can be used on any index CommonGrams need a prebuild dictionary of high freq terms - CommonTermsQuery can efficiently detect high freq terms at query time and "rewrite" the query CommonGrams can not search for high freq terms in conjunction - CommonTermsQuery can do stuff like "this is it" or "to be or not to be" in a conjunction query and still be reasonable in terms of performance. makes sense?
        Hide
        David Smiley added a comment -

        I did look at the code, and I understand that this makes its decisions dynamically at search time without needing any index changes. But from what I can tell, there is no positional information and so a search for "this is it" would erroneously match "this is NOT it" in indexed text. Ouch. CommonGrams doesn't have this problem. And CommonTermsQuery can sometimes change the intended AND/OR semantics, right? At least CTQ can be used when a phrase query is not used, and the AND/OR issue is maybe not a big deal. I guess these two techniques are complementary and can be used at the same time.

        Show
        David Smiley added a comment - I did look at the code, and I understand that this makes its decisions dynamically at search time without needing any index changes. But from what I can tell, there is no positional information and so a search for "this is it" would erroneously match "this is NOT it" in indexed text. Ouch. CommonGrams doesn't have this problem. And CommonTermsQuery can sometimes change the intended AND/OR semantics, right? At least CTQ can be used when a phrase query is not used, and the AND/OR issue is maybe not a big deal. I guess these two techniques are complementary and can be used at the same time.
        Hide
        Robert Muir added a comment -

        I think there is some confusion here:

        • CommonGrams is for speeding up PHRASE queries.
        • This feature is for speeding up BOOLEAN queries.
        Show
        Robert Muir added a comment - I think there is some confusion here: CommonGrams is for speeding up PHRASE queries. This feature is for speeding up BOOLEAN queries.
        Hide
        Matthew Willson added a comment -

        This is nice.

        Would it make sense, perhaps, to base the cut-off on the cumulative document frequency – so sort terms by DF, then add terms into the MUST subquery one at a time until a limit is exceeded on the total DF of all terms added. Then the remaining terms get added into a SHOULD subquery.

        This seems like it would set an upper bound on the total number of documents scored, or the total number of postings list entries which need to be inspected to select documents for scoring. (Good chance I'm missing something here mind...)

        Whereas a cut-off based on per-term doc frequency, you could have arbitrarily many terms introduced into the MUST subquery, provided they all slip under the per-term DF threshold. And hence arbitrarily many documents scored.

        Show
        Matthew Willson added a comment - This is nice. Would it make sense, perhaps, to base the cut-off on the cumulative document frequency – so sort terms by DF, then add terms into the MUST subquery one at a time until a limit is exceeded on the total DF of all terms added. Then the remaining terms get added into a SHOULD subquery. This seems like it would set an upper bound on the total number of documents scored, or the total number of postings list entries which need to be inspected to select documents for scoring. (Good chance I'm missing something here mind...) Whereas a cut-off based on per-term doc frequency, you could have arbitrarily many terms introduced into the MUST subquery, provided they all slip under the per-term DF threshold. And hence arbitrarily many documents scored.

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development