Lucene - Core
  1. Lucene - Core
  2. LUCENE-1644

Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When MultiTermQuery is used (via one of its subclasses, eg
      WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use
      "constant score mode", which pre-builds a filter and then wraps that
      filter as a ConstantScoreQuery.

      If you don't set that, it instead builds a [potentially massive]
      BooleanQuery with one SHOULD clause per term.

      There are some limitations of this approach:

      • The scores returned by the BooleanQuery are often quite
        meaningless to the app, so, one should be able to use a
        BooleanQuery yet get constant scores back. (Though I vaguely
        remember at least one example someone raised where the scores were
        useful...).
      • The resulting BooleanQuery can easily have too many clauses,
        throwing an extremely confusing exception to newish users.
      • It'd be better to have the freedom to pick "build filter up front"
        vs "build massive BooleanQuery", when constant scoring is enabled,
        because they have different performance tradeoffs.
      • In constant score mode, an OpenBitSet is always used, yet for
        sparse bit sets this does not give good performance.

      I think we could address these issues by giving BooleanQuery a
      constant score mode, then empower MultiTermQuery (when in constant
      score mode) to pick & choose whether to use BooleanQuery vs up-front
      filter, and finally empower MultiTermQuery to pick the best (sparse vs
      dense) bit set impl.

      1. LUCENE-1644.patch
        64 kB
        Michael McCandless
      2. LUCENE-1644.patch
        63 kB
        Michael McCandless
      3. LUCENE-1644.patch
        62 kB
        Michael McCandless
      4. LUCENE-1644.patch
        60 kB
        Michael McCandless
      5. LUCENE-1644.patch
        57 kB
        Michael McCandless
      6. LUCENE-1644.patch
        46 kB
        Michael McCandless

        Activity

        Hide
        Mark Miller added a comment -

        I'm inclined to think we push to 3.0?

        Show
        Mark Miller added a comment - I'm inclined to think we push to 3.0?
        Hide
        Michael McCandless added a comment -

        I agree... moving out.

        Show
        Michael McCandless added a comment - I agree... moving out.
        Hide
        Michael McCandless added a comment -

        I'd really like to get this one in for 2.9. The API is still
        malleable because the constant-score rewrite mode of MultiTermQuery
        hasn't been released yet.

        Attached patch, that adds a MultiTermQuery.RewriteMethod parameter,
        with current values FILTER, SCORING_BOOLEAN_QUERY and
        CONSTANT_BOOLEAN_QUERY. This replaces the
        setConstantScoreRewrite(boolean) method.

        I also added javadocs noting that the two remaining multi-term queries
        that have not yet switched over to FILTER rewrite method
        (WildcardQuery and PrefixQuery) will switch over in 3.0. (LUCENE-1557
        is already open to make that switch.)

        Show
        Michael McCandless added a comment - I'd really like to get this one in for 2.9. The API is still malleable because the constant-score rewrite mode of MultiTermQuery hasn't been released yet. Attached patch, that adds a MultiTermQuery.RewriteMethod parameter, with current values FILTER, SCORING_BOOLEAN_QUERY and CONSTANT_BOOLEAN_QUERY. This replaces the setConstantScoreRewrite(boolean) method. I also added javadocs noting that the two remaining multi-term queries that have not yet switched over to FILTER rewrite method (WildcardQuery and PrefixQuery) will switch over in 3.0. ( LUCENE-1557 is already open to make that switch.)
        Hide
        Robert Muir added a comment -

        Mike, one question: couldn't you consider FILTER versus CONSTANT_BOOLEAN_QUERY an implementation detail? could lucene pick / switch over to the best one?

        I guess in my opinion, there was nothing wrong with the setConstantScoreRewrite() from an API perspective, but behind the scenes maybe lucene could be a bit smarter about how it actually accomplishes that?

        Show
        Robert Muir added a comment - Mike, one question: couldn't you consider FILTER versus CONSTANT_BOOLEAN_QUERY an implementation detail? could lucene pick / switch over to the best one? I guess in my opinion, there was nothing wrong with the setConstantScoreRewrite() from an API perspective, but behind the scenes maybe lucene could be a bit smarter about how it actually accomplishes that?
        Hide
        Mark Miller added a comment -

        Thats originally what I thought this issue was - the correct method would be chosen under the covers using a heuristic. Progress not perfection deal?

        Show
        Mark Miller added a comment - Thats originally what I thought this issue was - the correct method would be chosen under the covers using a heuristic. Progress not perfection deal?
        Hide
        Michael McCandless added a comment -

        couldn't you consider FILTER versus CONSTANT_BOOLEAN_QUERY an implementation detail? could lucene pick / switch over to the best one?

        Yeah I struggled with this. I completely agree it's an impl detail – the user should just have to say "I want constant scoring" and Lucene finds the most performant way to achieve it.

        But then I realized it's not obvious when one impl should be chosen over another. Often FILTER is faster than CONSTANT_BOOLEAN_QUERY, but at some point once the index becomes large enough the underlying O(maxDoc) cost (w/ small constant in front) of FILTER will dominate, or if the number of terms/docs that match is small then CONSTANT_BOOLEAN_QUERY will win, etc. If number of terms exceeds BooleanQuery's maxClauseCount, you must use FILTER.

        And my intuitions weren't right (I had thought NumericRangeQuery, since in general it doesn't produce too many terms, would perform well with BOOLEAN_QUERY, but from Uwe's numbers that's not the case; though we should re-test now that, with this patch, no CPU is spent on scoring).

        So, I was uncomfortable trying to make Lucene too smart under the hood, at least for this first go at it.

        Maybe we could add an AUTO option, that would make try to decide what's best? This way if we mess up its smarts, the user can still fallback and force one method over another.

        (Though, since the maxClauseCount is so clearly a dead-end, maybe even in CONSTANT_BOOLEAN_QUERY mode we should forcefully fallback to FILTER on hitting too many terms).

        Show
        Michael McCandless added a comment - couldn't you consider FILTER versus CONSTANT_BOOLEAN_QUERY an implementation detail? could lucene pick / switch over to the best one? Yeah I struggled with this. I completely agree it's an impl detail – the user should just have to say "I want constant scoring" and Lucene finds the most performant way to achieve it. But then I realized it's not obvious when one impl should be chosen over another. Often FILTER is faster than CONSTANT_BOOLEAN_QUERY, but at some point once the index becomes large enough the underlying O(maxDoc) cost (w/ small constant in front) of FILTER will dominate, or if the number of terms/docs that match is small then CONSTANT_BOOLEAN_QUERY will win, etc. If number of terms exceeds BooleanQuery's maxClauseCount, you must use FILTER. And my intuitions weren't right (I had thought NumericRangeQuery, since in general it doesn't produce too many terms, would perform well with BOOLEAN_QUERY, but from Uwe's numbers that's not the case; though we should re-test now that, with this patch, no CPU is spent on scoring). So, I was uncomfortable trying to make Lucene too smart under the hood, at least for this first go at it. Maybe we could add an AUTO option, that would make try to decide what's best? This way if we mess up its smarts, the user can still fallback and force one method over another. (Though, since the maxClauseCount is so clearly a dead-end, maybe even in CONSTANT_BOOLEAN_QUERY mode we should forcefully fallback to FILTER on hitting too many terms).
        Hide
        Robert Muir added a comment -

        Mike, I see your point, I believe two things are involved, but I could be wrong on this!

        In the tests I have run, for "uncached" queries that match only a few docs, query is faster.
        FILTER is faster if things are cached or if the query matches many documents, even for a huge index.

        I don't think this disagrees with Uwe's numbers really, I just think the OS cache etc plays a big part, at least thats what I thought I was seeing on my end...

        Show
        Robert Muir added a comment - Mike, I see your point, I believe two things are involved, but I could be wrong on this! In the tests I have run, for "uncached" queries that match only a few docs, query is faster. FILTER is faster if things are cached or if the query matches many documents, even for a huge index. I don't think this disagrees with Uwe's numbers really, I just think the OS cache etc plays a big part, at least thats what I thought I was seeing on my end...
        Hide
        Michael McCandless added a comment -

        Your results seem to agree w/ Uwe's (once things are "hot" then FILTER is even faster, even on a rather large index). Uwe's results were on a 5M doc index. How large is your index?

        OK so how about I add an CONSTANT_AUTO mode that tries to pick? I can add [expert] setters that you can use to set the cutoffs. Then in the javadocs, strongly encourage CONSTANT_AUTO, and state that in 3.0 it'll be the default?

        Show
        Michael McCandless added a comment - Your results seem to agree w/ Uwe's (once things are "hot" then FILTER is even faster, even on a rather large index). Uwe's results were on a 5M doc index. How large is your index? OK so how about I add an CONSTANT_AUTO mode that tries to pick? I can add [expert] setters that you can use to set the cutoffs. Then in the javadocs, strongly encourage CONSTANT_AUTO, and state that in 3.0 it'll be the default?
        Hide
        Robert Muir added a comment -

        Mike: > 100M docs. it does not fit in os cache

        I really should rerun tests since its been quite some time.

        What do you think about CONSTANT_AUTO doing your maxClauseCount trick?

        Show
        Robert Muir added a comment - Mike: > 100M docs. it does not fit in os cache I really should rerun tests since its been quite some time. What do you think about CONSTANT_AUTO doing your maxClauseCount trick?
        Hide
        Michael McCandless added a comment -

        Mike: > 100M docs. it does not fit in os cache

        Heh OK.

        What do you think about CONSTANT_AUTO doing your maxClauseCount trick?

        I'm thinking there'd be a cutoff on the number of terms. EG maybe it defaults to 25 . So, if more than 25 terms are encountered, we'll use FILTER; else we use BOOLEAN_QUERY. And, then, at all times I'd cutover to FILTER if that threshold exceeds the maxClauseCount.

        Show
        Michael McCandless added a comment - Mike: > 100M docs. it does not fit in os cache Heh OK. What do you think about CONSTANT_AUTO doing your maxClauseCount trick? I'm thinking there'd be a cutoff on the number of terms. EG maybe it defaults to 25 . So, if more than 25 terms are encountered, we'll use FILTER; else we use BOOLEAN_QUERY. And, then, at all times I'd cutover to FILTER if that threshold exceeds the maxClauseCount.
        Hide
        Michael McCandless added a comment -

        Maybe, we simply go back to the boolean (constant scoring or not), and then add this term count cutoff to control when a filter is used vs BooleanQuery?

        Show
        Michael McCandless added a comment - Maybe, we simply go back to the boolean (constant scoring or not), and then add this term count cutoff to control when a filter is used vs BooleanQuery?
        Hide
        Robert Muir added a comment - - edited

        Mike, I am afraid that might hurt some people's performance.
        I'm a bit concerned my index/queries are maybe abnormal and don't want to break the general case.

        I'm not too familiar with trie [what it would do with a really general range query], but a simpler example would be no stopwords, wildcard query of "th?" (matching "the")
        maybe it only matches one term, but that term is very common / dense bitset and probably "hot".

        In this case the filter would be better, even though its 1 term.

        Show
        Robert Muir added a comment - - edited Mike, I am afraid that might hurt some people's performance. I'm a bit concerned my index/queries are maybe abnormal and don't want to break the general case. I'm not too familiar with trie [what it would do with a really general range query] , but a simpler example would be no stopwords, wildcard query of "th?" (matching "the") maybe it only matches one term, but that term is very common / dense bitset and probably "hot". In this case the filter would be better, even though its 1 term.
        Hide
        Michael McCandless added a comment -

        In this case the filter would be better, even though its 1 term.

        You're right. So... maybe we could also take the net doc count into account? Loading the TermInfo shouldn't be costly, since it's loaded anyway when the query/filter runs (and, it's cached). We'd then sum it up as we're checking, and if it crosses the threshold (default maybe 10% of maxDoc()?) we'd cut out to FILTER?

        Though, if we go this route, we should probably do it under a new CONSTANT_AUTO method (instead of falling back to the clean single boolean constantScoreRewrite), ie, it's getting a little too smart to just always do w/ no way to fallback and force it to just rewrite one way or another.

        Or we can punt on any smarts altogether for now (this being the "eve" of 2.9) and simply start with the three explicit rewrite methods.

        Show
        Michael McCandless added a comment - In this case the filter would be better, even though its 1 term. You're right. So... maybe we could also take the net doc count into account? Loading the TermInfo shouldn't be costly, since it's loaded anyway when the query/filter runs (and, it's cached). We'd then sum it up as we're checking, and if it crosses the threshold (default maybe 10% of maxDoc()?) we'd cut out to FILTER? Though, if we go this route, we should probably do it under a new CONSTANT_AUTO method (instead of falling back to the clean single boolean constantScoreRewrite), ie, it's getting a little too smart to just always do w/ no way to fallback and force it to just rewrite one way or another. Or we can punt on any smarts altogether for now (this being the "eve" of 2.9) and simply start with the three explicit rewrite methods.
        Hide
        Robert Muir added a comment -

        Mike, that heuristic sounds really promising to me.

        at the same time, starting to think your .setMultiTermRewriteMethod really is the right way to go (regardless of whether it immediately has AUTO)...

        the thing i never really liked about .setConstantScoreRewrite is that its kinda misleading, since it really does more than just toggle scoring...

        Show
        Robert Muir added a comment - Mike, that heuristic sounds really promising to me. at the same time, starting to think your .setMultiTermRewriteMethod really is the right way to go (regardless of whether it immediately has AUTO)... the thing i never really liked about .setConstantScoreRewrite is that its kinda misleading, since it really does more than just toggle scoring...
        Hide
        Mark Miller added a comment -

        Or we can punt on any smarts altogether for now (this being the "eve" of 2.9) and simply start with the three explicit rewrite methods.

        +1 - unless you just want to pump it out, we want to make the current change anyway I think, so lets leave it at that and add an auto later.

        We should get 2.9 out sooner rather than later I think.

        Show
        Mark Miller added a comment - Or we can punt on any smarts altogether for now (this being the "eve" of 2.9) and simply start with the three explicit rewrite methods. +1 - unless you just want to pump it out, we want to make the current change anyway I think, so lets leave it at that and add an auto later. We should get 2.9 out sooner rather than later I think.
        Hide
        Robert Muir added a comment -

        Mike, sorry to flip back and forth on this since last night.

        The patch you have here does make things less confusing and I think that after talking things thru, the "big boolean option" with heuristics might actually make things more confusing, even if its seems convenient at first.

        in the future the AUTO would be a good feature I think, best of both worlds.

        Show
        Robert Muir added a comment - Mike, sorry to flip back and forth on this since last night. The patch you have here does make things less confusing and I think that after talking things thru, the "big boolean option" with heuristics might actually make things more confusing, even if its seems convenient at first. in the future the AUTO would be a good feature I think, best of both worlds.
        Hide
        Michael McCandless added a comment -

        Attached rough patch – javadocs are missing/not
        updated, need to add new tests, need to fix QueryParser.jj, etc., but
        all tests pass.

        Here's what I did:

        • Changed the MTQ.RewriteMethod class from a simple Parameter to its
          own abstract base class w/ a single method, rewrite, which
          MultiTermQuery.rewrite delegates to.
        • Switched over CONSTANT_SCORE_FILTER_REWRITE,
          SCORING_BOOLEAN_QUERY_REWRITE and
          CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE. These classes are private
          (they have no configuration), and I created final static singleton
          instances for them.
        • Created ConstantScoreAutoRewrite (and the default
          CONSTANT_SCORE_AUTO_REWRITE instance) that you can configure based
          on term count & doc count, as to when it cuts over to
          CONSTANT_SCORE_FILTER_REWRITE.

        This approach also has the benefit of allowing customization entirely,
        if needed, of the "rewrite strategy", if none of the 4 choices work
        for you.

        Show
        Michael McCandless added a comment - Attached rough patch – javadocs are missing/not updated, need to add new tests, need to fix QueryParser.jj, etc., but all tests pass. Here's what I did: Changed the MTQ.RewriteMethod class from a simple Parameter to its own abstract base class w/ a single method, rewrite, which MultiTermQuery.rewrite delegates to. Switched over CONSTANT_SCORE_FILTER_REWRITE, SCORING_BOOLEAN_QUERY_REWRITE and CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE. These classes are private (they have no configuration), and I created final static singleton instances for them. Created ConstantScoreAutoRewrite (and the default CONSTANT_SCORE_AUTO_REWRITE instance) that you can configure based on term count & doc count, as to when it cuts over to CONSTANT_SCORE_FILTER_REWRITE. This approach also has the benefit of allowing customization entirely, if needed, of the "rewrite strategy", if none of the 4 choices work for you.
        Hide
        Uwe Schindler added a comment -

        Sorry that I came back too late to this issue, I am in holidays at the moment.

        In my opinion, the Parameter instead of boolean is a good idea. The latest patch is also a good idea, I only hve some small problems with it:

        • Why did you make so many internal things public? The additional ctor to MultiTermQueryrapperFilter should be package-private or protected (the class is not abstract, but should be used like abstract, so it ,must have only protected ctors). Only the public instances TermRangeFilter should have public ctors.
        • getFilter()/getEnum should stay protected.
        • I do not like the wired caching of Terms. A more cleaner API would be a new class CachingFilteredTermEnum, that can turn on caching for e.g. the first 20 terms and then reset. In this case, the API would stay clear and the filter code does not need to be changed at all (it just harvests the TermEnum, if it is cached or not). I would propose something like: new CachingFilteredTermEnum(originalEnum), use it normally, then termEnum.reset() to consume again and termEnum.purgeCache() if caching no longer needed and to be switched off (after the first 25 terms or so). The problem with MultiTermQueryWrapper filter is, that the filter is normally stateless (no reader or termenum). So normally the method getDocIdSet() should get the termenum or wrapper in addition to the indexreader. This is not very good (it took me some time, to understand, what you are doing).
        Show
        Uwe Schindler added a comment - Sorry that I came back too late to this issue, I am in holidays at the moment. In my opinion, the Parameter instead of boolean is a good idea. The latest patch is also a good idea, I only hve some small problems with it: Why did you make so many internal things public? The additional ctor to MultiTermQueryrapperFilter should be package-private or protected (the class is not abstract, but should be used like abstract, so it ,must have only protected ctors). Only the public instances TermRangeFilter should have public ctors. getFilter()/getEnum should stay protected. I do not like the wired caching of Terms. A more cleaner API would be a new class CachingFilteredTermEnum, that can turn on caching for e.g. the first 20 terms and then reset. In this case, the API would stay clear and the filter code does not need to be changed at all (it just harvests the TermEnum, if it is cached or not). I would propose something like: new CachingFilteredTermEnum(originalEnum), use it normally, then termEnum.reset() to consume again and termEnum.purgeCache() if caching no longer needed and to be switched off (after the first 25 terms or so). The problem with MultiTermQueryWrapper filter is, that the filter is normally stateless (no reader or termenum). So normally the method getDocIdSet() should get the termenum or wrapper in addition to the indexreader. This is not very good (it took me some time, to understand, what you are doing).
        Hide
        Uwe Schindler added a comment -

        The biggest problem is, that this caching gets completely wired with multi-segment indexes:
        The rewriting is done on the top-level reader. In this case the boolean query would be built and the terms cached. If there are too many terms, it creates a filter instance with the cached terms.
        The rewritten query is then executed against all sub-readers using the cached terms and a fixed term enum. Normally this would create a docidset for the current index reader, the rewrite did it for the top-level index reader -> the wron doc ids are returned and so on. So you cannot reuse the collected terms from the rewrite operation in the getDocIdSet calls.

        So please turn of this caching at all! As noted before, the important thing is, that the returned filter by rewrite is stateless and should not know anythis about index readers. The idex reader is passed in getDocIdSet any is different for non-optimized indexes.

        You have seen no tests fail, because all RangeQuery tests use optimized indexes.

        Show
        Uwe Schindler added a comment - The biggest problem is, that this caching gets completely wired with multi-segment indexes: The rewriting is done on the top-level reader. In this case the boolean query would be built and the terms cached. If there are too many terms, it creates a filter instance with the cached terms. The rewritten query is then executed against all sub-readers using the cached terms and a fixed term enum. Normally this would create a docidset for the current index reader, the rewrite did it for the top-level index reader -> the wron doc ids are returned and so on. So you cannot reuse the collected terms from the rewrite operation in the getDocIdSet calls. So please turn of this caching at all! As noted before, the important thing is, that the returned filter by rewrite is stateless and should not know anythis about index readers. The idex reader is passed in getDocIdSet any is different for non-optimized indexes. You have seen no tests fail, because all RangeQuery tests use optimized indexes.
        Hide
        Michael McCandless added a comment -

        The biggest problem is, that this caching gets completely wired with multi-segment indexes

        Right, I caught this as well (there is one test that fails when I forcefully swap in constant-boolean-query as the constant score method), and I'm now turning off the caching.

        I've fixed it locally – will post a new rev soon.

        Show
        Michael McCandless added a comment - The biggest problem is, that this caching gets completely wired with multi-segment indexes Right, I caught this as well (there is one test that fails when I forcefully swap in constant-boolean-query as the constant score method), and I'm now turning off the caching. I've fixed it locally – will post a new rev soon.
        Hide
        Michael McCandless added a comment -

        Attached patch: fixed some bugs in the last rev, updated test cases,
        javadocs, CHANGES. I also optimized MultiTermQueryWrapperFilter to
        use the bulk-read API from termDocs.

        I confirmed all tests pass if I temporarily switch
        CONSTANT_SCORE_FILTER_REWRITE to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT.

        I changed QueryParser to use CONSTANT_SCORE_AUTO for rewrite (it was
        previously CONSTANT_FILTER).

        I still need to run some perf tests to get a rough sense of decent
        defaults for CONSTANT_SCORE_AUTO cutover thresholds.

        getFilter()/getEnum should stay protected.

        OK I made getEnum protected again.

        I had tentatively made it public so that one could create their own
        [external] rewrite methods. But I think (if we leave it protected),
        one could still make an inner/nested class that can access getEnum().

        Do we even need getFilter()? I removed it in the patch.

        Show
        Michael McCandless added a comment - Attached patch: fixed some bugs in the last rev, updated test cases, javadocs, CHANGES. I also optimized MultiTermQueryWrapperFilter to use the bulk-read API from termDocs. I confirmed all tests pass if I temporarily switch CONSTANT_SCORE_FILTER_REWRITE to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT. I changed QueryParser to use CONSTANT_SCORE_AUTO for rewrite (it was previously CONSTANT_FILTER). I still need to run some perf tests to get a rough sense of decent defaults for CONSTANT_SCORE_AUTO cutover thresholds. getFilter()/getEnum should stay protected. OK I made getEnum protected again. I had tentatively made it public so that one could create their own [external] rewrite methods. But I think (if we leave it protected), one could still make an inner/nested class that can access getEnum(). Do we even need getFilter()? I removed it in the patch.
        Hide
        Uwe Schindler added a comment - - edited

        Hi Mike,

        patch looks good. I was a little bit confused about the high term number cut off, but it is using Math.max to limit it to the current BooleanQuery max clause count.

        Some small things:

        OK I made getEnum protected again.

        ...but only in MultiTermQuery itsself. Everywhere else (even in the backwards compatibility override test [JustCompile] it is public). And the same should be for the incNumberOfTerms (also protected). I think the rewrite method is internal to MultiTermQuery and always implemented ina subclass of MTQ as inner class.

        Also the current singletons are not really singletons, because queries that are unserialized will contain instances that are not the "singleton" instances - and will therefore fail to produce correct hashcode/equals tests. The problem behind: The singletons are serializable but do not return itsself in readResolve() (not implemented). All singletons that are serializable must implement readResolve and return the singleton instance (see Parameter base class or the parser singletons in FieldCache).

        The instance in the default Auto RewriteMethod is still modifiable. Is this correct? So one could modify the defaults by setting properties in this instance. Is this correct?

        Show
        Uwe Schindler added a comment - - edited Hi Mike, patch looks good. I was a little bit confused about the high term number cut off, but it is using Math.max to limit it to the current BooleanQuery max clause count. Some small things: OK I made getEnum protected again. ...but only in MultiTermQuery itsself. Everywhere else (even in the backwards compatibility override test [JustCompile] it is public). And the same should be for the incNumberOfTerms (also protected). I think the rewrite method is internal to MultiTermQuery and always implemented ina subclass of MTQ as inner class. Also the current singletons are not really singletons, because queries that are unserialized will contain instances that are not the "singleton" instances - and will therefore fail to produce correct hashcode/equals tests. The problem behind: The singletons are serializable but do not return itsself in readResolve() (not implemented). All singletons that are serializable must implement readResolve and return the singleton instance (see Parameter base class or the parser singletons in FieldCache). The instance in the default Auto RewriteMethod is still modifiable. Is this correct? So one could modify the defaults by setting properties in this instance. Is this correct?
        Hide
        Michael McCandless added a comment -

        I was a little bit confused about the high term number cut off,

        Sorry I still need to do some perf testing to pick an appropriate
        default here.

        Everywhere else (even in the backwards compatibility override test [JustCompile] it is public). And the same should be for the incNumberOfTerms (also protected).

        Woops – I'll fix. Thanks for catching even though you're on
        "vacation"

        Also the current singletons are not really singletons, because queries that are unserialized will contain instances that are not the "singleton" instances

        Sigh. I'll do what FieldCache's parser singletons do.

        The instance in the default Auto RewriteMethod is still modifiable. Is this correct?

        I was thinking this was OK, ie, you could set the default cutoffs for
        anything that used the AUTO_DEFAULT. But it is static (global), so
        that's not great. I guess I'll make it an anonymous subclass of
        ConstantScoreAutoRewrite that disallows changes.

        Show
        Michael McCandless added a comment - I was a little bit confused about the high term number cut off, Sorry I still need to do some perf testing to pick an appropriate default here. Everywhere else (even in the backwards compatibility override test [JustCompile] it is public). And the same should be for the incNumberOfTerms (also protected). Woops – I'll fix. Thanks for catching even though you're on "vacation" Also the current singletons are not really singletons, because queries that are unserialized will contain instances that are not the "singleton" instances Sigh. I'll do what FieldCache's parser singletons do. The instance in the default Auto RewriteMethod is still modifiable. Is this correct? I was thinking this was OK, ie, you could set the default cutoffs for anything that used the AUTO_DEFAULT. But it is static (global), so that's not great. I guess I'll make it an anonymous subclass of ConstantScoreAutoRewrite that disallows changes.
        Hide
        Michael McCandless added a comment -

        New patch attached w/ above fixes plus some javadoc fixes. It has
        some nocommits which I'll clean up before committing.

        Show
        Michael McCandless added a comment - New patch attached w/ above fixes plus some javadoc fixes. It has some nocommits which I'll clean up before committing.
        Hide
        Michael McCandless added a comment -

        New patch attached. I think it's ready to commit!

        I ran a series of simple tests, using a 20.0 million doc Wikipedia
        index. I tested w/ PrefixQuery, using different prefixes to tickle the
        different number of matching terms vs matching docs.

        I first pursued the "matches many terms but few docs" case, and found
        at around 350 terms the filter method becomes faster.

        Then, I fixed the number of terms at 350 (modified PrefixQuery to
        pretend the enum stopped there) and tested different number of "doc
        visit counts" (sum of docFreq of each term) and found ~ 0.1% (1/1000)
        of the maxDoc() was the cutover.

        I also switched NumericRange* to use CONSTANT_SCORE_AUTO by default.

        Show
        Michael McCandless added a comment - New patch attached. I think it's ready to commit! I ran a series of simple tests, using a 20.0 million doc Wikipedia index. I tested w/ PrefixQuery, using different prefixes to tickle the different number of matching terms vs matching docs. I first pursued the "matches many terms but few docs" case, and found at around 350 terms the filter method becomes faster. Then, I fixed the number of terms at 350 (modified PrefixQuery to pretend the enum stopped there) and tested different number of "doc visit counts" (sum of docFreq of each term) and found ~ 0.1% (1/1000) of the maxDoc() was the cutover. I also switched NumericRange* to use CONSTANT_SCORE_AUTO by default.
        Hide
        Uwe Schindler added a comment -

        Looks good, Mike.

        I think NumericRangeQuery should also be swiched to auto mode, you are right. My perf test was a little bit unfair, because it used a 5 Mio index with random integers. The queries were also random and the sum of docs/index size was about 1/3 (because of the random query). So most quries hit abou one third of all docs. In this case, always the filter is faster. For very small ranges with few terms, it may be really good to use

        A good thing would also be to set the mode to filter automatically, if precisionStep >6 for longs (valSize=64) and precStep > 8 for ints (valSize=32), because here the number of terms is often too big.

        One bug in ConstantScoreRangeQuery: You set the default to AUTO, the method to prevent changing this is wrong:

           /** Changes of mode are not supported by this class (fixed to constant score rewrite mode) */
        -  public void setConstantScoreRewrite(boolean constantScoreRewrite) {
        -    if (!constantScoreRewrite)
        -      throw new UnsupportedOperationException("Use TermRangeQuery instead to enable boolean query rewrite.");
        +  public void setRewriteMethod(RewriteMethod method) {
        +    if (method != CONSTANT_SCORE_FILTER_REWRITE) {
        +      throw new UnsupportedOperationException("Use TermRangeQuery instead to change the rewrite method.");
        +    }
           }
        

        I would change this to simply always throw UOE on any change in ConstantScoreRangeQuery.

        Show
        Uwe Schindler added a comment - Looks good, Mike. I think NumericRangeQuery should also be swiched to auto mode, you are right. My perf test was a little bit unfair, because it used a 5 Mio index with random integers. The queries were also random and the sum of docs/index size was about 1/3 (because of the random query). So most quries hit abou one third of all docs. In this case, always the filter is faster. For very small ranges with few terms, it may be really good to use A good thing would also be to set the mode to filter automatically, if precisionStep >6 for longs (valSize=64) and precStep > 8 for ints (valSize=32), because here the number of terms is often too big. One bug in ConstantScoreRangeQuery: You set the default to AUTO, the method to prevent changing this is wrong: /** Changes of mode are not supported by this class (fixed to constant score rewrite mode) */ - public void setConstantScoreRewrite( boolean constantScoreRewrite) { - if (!constantScoreRewrite) - throw new UnsupportedOperationException( "Use TermRangeQuery instead to enable boolean query rewrite." ); + public void setRewriteMethod(RewriteMethod method) { + if (method != CONSTANT_SCORE_FILTER_REWRITE) { + throw new UnsupportedOperationException( "Use TermRangeQuery instead to change the rewrite method." ); + } } I would change this to simply always throw UOE on any change in ConstantScoreRangeQuery.
        Hide
        Uwe Schindler added a comment -

        Another question: Maybe it would be good to change the FieldCache to also use the buffered TermDocs variant in the Uninverter code?
        Has done anybody perf tests here. It would be easy to change this.

        Show
        Uwe Schindler added a comment - Another question: Maybe it would be good to change the FieldCache to also use the buffered TermDocs variant in the Uninverter code? Has done anybody perf tests here. It would be easy to change this.
        Hide
        Michael McCandless added a comment -

        I would change this to simply always throw UOE on any change in ConstantScoreRangeQuery.

        OK will do.

        A good thing would also be to set the mode to filter automatically, if precisionStep >6 for longs (valSize=64) and precStep > 8 for ints (valSize=32), because here the number of terms is often too big.

        Will do.

        Hmm – my last patch hit this test failure, after I had switched to CONSTANT_SCORE_AUTO for NumericRangeQuery:

            [junit] NOTE: random seed of testcase 'testRandomTrieAndClassicRangeQuery_NoTrie' was: -2919237198484373178
            [junit] ------------- ---------------- ---------------
            [junit] Testcase: testRandomTrieAndClassicRangeQuery_NoTrie(org.apache.lucene.search.TestNumericRangeQuery32):	FAILED
            [junit] Total number of terms should be equal for unlimited precStep expected:<668552> but was:<668584>
            [junit] junit.framework.AssertionFailedError: Total number of terms should be equal for unlimited precStep expected:<668552> but was:<668584>
            [junit] 	at org.apache.lucene.search.TestNumericRangeQuery32.testRandomTrieAndClassicRangeQuery(TestNumericRangeQuery32.java:266)
            [junit] 	at org.apache.lucene.search.TestNumericRangeQuery32.testRandomTrieAndClassicRangeQuery_NoTrie(TestNumericRangeQuery32.java:287)
            [junit] 	at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
            [junit] 
        

        I haven't dug into it yet...

        Show
        Michael McCandless added a comment - I would change this to simply always throw UOE on any change in ConstantScoreRangeQuery. OK will do. A good thing would also be to set the mode to filter automatically, if precisionStep >6 for longs (valSize=64) and precStep > 8 for ints (valSize=32), because here the number of terms is often too big. Will do. Hmm – my last patch hit this test failure, after I had switched to CONSTANT_SCORE_AUTO for NumericRangeQuery: [junit] NOTE: random seed of testcase 'testRandomTrieAndClassicRangeQuery_NoTrie' was: -2919237198484373178 [junit] ------------- ---------------- --------------- [junit] Testcase: testRandomTrieAndClassicRangeQuery_NoTrie(org.apache.lucene.search.TestNumericRangeQuery32): FAILED [junit] Total number of terms should be equal for unlimited precStep expected:<668552> but was:<668584> [junit] junit.framework.AssertionFailedError: Total number of terms should be equal for unlimited precStep expected:<668552> but was:<668584> [junit] at org.apache.lucene.search.TestNumericRangeQuery32.testRandomTrieAndClassicRangeQuery(TestNumericRangeQuery32.java:266) [junit] at org.apache.lucene.search.TestNumericRangeQuery32.testRandomTrieAndClassicRangeQuery_NoTrie(TestNumericRangeQuery32.java:287) [junit] at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88) [junit] I haven't dug into it yet...
        Hide
        Michael McCandless added a comment -

        Hmm - my last patch hit this test failure, after I had switched to CONSTANT_SCORE_AUTO for NumericRangeQuery:

        Heh – so I tried to repeat the failure, and couldn't. The I remembered that nice random seed that was printed out, so I went to the test and wired that seed and BOOM the failure happened. Thank you Hoss

        I found the failure – I was just failing to incr the term count in the "auto uses BooleanQuery" case. New patch soon...

        Show
        Michael McCandless added a comment - Hmm - my last patch hit this test failure, after I had switched to CONSTANT_SCORE_AUTO for NumericRangeQuery: Heh – so I tried to repeat the failure, and couldn't. The I remembered that nice random seed that was printed out, so I went to the test and wired that seed and BOOM the failure happened. Thank you Hoss I found the failure – I was just failing to incr the term count in the "auto uses BooleanQuery" case. New patch soon...
        Hide
        Michael McCandless added a comment -

        New patch. I'll commit in a day or two, though I'll wait first for LUCENE-1693 to go in (it'll conflict on QueryParser.jj/java).

        Show
        Michael McCandless added a comment - New patch. I'll commit in a day or two, though I'll wait first for LUCENE-1693 to go in (it'll conflict on QueryParser.jj/java).
        Hide
        Uwe Schindler added a comment -

        You were faster. The problem was the missing update of term count. I first thought, that you maybe counting the terms two times (one time at the beginning and then again in the filter, if the auto mode switches). But you are now only updating the term count at the end when you build the boolean query.

        By the way: How about initializing the ArrayList not with the default size, but maybe termCount cutoff/2 or something like that?

        Show
        Uwe Schindler added a comment - You were faster. The problem was the missing update of term count. I first thought, that you maybe counting the terms two times (one time at the beginning and then again in the filter, if the auto mode switches). But you are now only updating the term count at the end when you build the boolean query. By the way: How about initializing the ArrayList not with the default size, but maybe termCount cutoff/2 or something like that?
        Hide
        Michael McCandless added a comment -

        How about initializing the ArrayList not with the default size, but maybe termCount cutoff/2 or something like that?

        That makes me a bit nervous, eg if someone sets these limits to something immense?

        Show
        Michael McCandless added a comment - How about initializing the ArrayList not with the default size, but maybe termCount cutoff/2 or something like that? That makes me a bit nervous, eg if someone sets these limits to something immense?
        Hide
        Michael McCandless added a comment -

        Thanks Uwe!

        Show
        Michael McCandless added a comment - Thanks Uwe!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development