Lucene - Core
  1. Lucene - Core
  2. LUCENE-1424

Change all multi-term querys so that they extend MultiTermQuery and allow for a constant score mode

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Cleans up a bunch of code duplication, closer to how things should be - design wise, gives us constant score for all the multi term queries, and allows us at least the option of highlighting the constant score queries without much further work.

      1. LUCENE-1424.patch
        19 kB
        Mark Miller
      2. LUCENE-1424.patch
        19 kB
        Mark Miller
      3. LUCENE-1424.patch
        22 kB
        Mark Miller
      4. LUCENE-1424.patch
        26 kB
        Mark Miller
      5. LUCENE-1424.patch
        39 kB
        Mark Miller
      6. LUCENE-1424.patch
        47 kB
        Mark Miller
      7. LUCENE-1424.patch
        111 kB
        Mark Miller
      8. LUCENE-1424.patch
        114 kB
        Mark Miller
      9. LUCENE-1424.patch
        99 kB
        Mark Miller
      10. LUCENE-1424.patch
        107 kB
        Michael McCandless
      11. LUCENE-1424.patch
        116 kB
        Mark Miller
      12. LUCENE-1424.patch
        128 kB
        Michael McCandless
      13. LUCENE-1424.patch
        128 kB
        Mark Miller
      14. lucene-1424.patch
        4 kB
        Michael Busch
      15. LUCENE-1424-dep_rng_cstr_fix.patch
        2 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Thanks Mark – I just committed that.

          Show
          Michael McCandless added a comment - Thanks Mark – I just committed that.
          Hide
          Mark Miller added a comment -

          Fixes the null problem with deprecated RangeQuery constructors and adds unit test for it.

          Show
          Mark Miller added a comment - Fixes the null problem with deprecated RangeQuery constructors and adds unit test for it.
          Hide
          Mark Miller added a comment -

          Another one: looks like as we were iterating back and forth, after I changed the constructors to the non deprecated, we lost the ability of the deprecated constructors to accept null properly. Doh! I'll try and fix it tonight.

          Show
          Mark Miller added a comment - Another one: looks like as we were iterating back and forth, after I changed the constructors to the non deprecated, we lost the ability of the deprecated constructors to accept null properly. Doh! I'll try and fix it tonight.
          Hide
          Michael Busch added a comment -

          Committed revision 713225.

          Show
          Michael Busch added a comment - Committed revision 713225.
          Hide
          Michael Busch added a comment -

          Here's the patch.
          'ant test test-tag' passes now.

          I'll commit shortly.

          Show
          Michael Busch added a comment - Here's the patch. 'ant test test-tag' passes now. I'll commit shortly.
          Hide
          Mark Miller added a comment -

          For kicks, I tried implementing the MultiTermGenerator so that it returns an inner class DocSetId iterator that iterates over matching terms/docs, thinking we could save some speed by avoiding the OpenBitSet creation, population, read. Instead, even in tests that didn't involve any skip-to (not sure what/if any of these cases actually do, still wrapping my head around that), the non generate the bitset first approach was a tad slower rather than faster. Don't know exactly why I the moment, but thought I'd chronicle the attempt for future optimizers. Probably all the stuff that happens between next calls slows down the enumeration quite a bit, so that doing it all at once saves enough time to make up for the OpenBitSet stuff.

          Show
          Mark Miller added a comment - For kicks, I tried implementing the MultiTermGenerator so that it returns an inner class DocSetId iterator that iterates over matching terms/docs, thinking we could save some speed by avoiding the OpenBitSet creation, population, read. Instead, even in tests that didn't involve any skip-to (not sure what/if any of these cases actually do, still wrapping my head around that), the non generate the bitset first approach was a tad slower rather than faster. Don't know exactly why I the moment, but thought I'd chronicle the attempt for future optimizers. Probably all the stuff that happens between next calls slows down the enumeration quite a bit, so that doing it all at once saves enough time to make up for the OpenBitSet stuff.
          Hide
          Michael McCandless added a comment -

          Thanks Michael!

          Show
          Michael McCandless added a comment - Thanks Michael!
          Hide
          Michael Busch added a comment -

          Thanks Mark.

          OK, I'll make a patch and commit it later today.

          Show
          Michael Busch added a comment - Thanks Mark. OK, I'll make a patch and commit it later today.
          Hide
          Mark Miller added a comment -

          Looks like a fine solution to me. Cool tests.

          Show
          Mark Miller added a comment - Looks like a fine solution to me. Cool tests.
          Hide
          Michael Busch added a comment -

          Yeah, I understand the reason... and easy "fix" would be in QueryParser#newRangeQuery():

            protected Query newRangeQuery(String field, String part1, String part2, boolean inclusive) {
              RangeQuery query;  
            
              if (constantScoreRewrite) {
                // TODO: remove in Lucene 3.0
                query = new ConstantScoreRangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator);
              } else {
                query = new RangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator);
              }
              query.setConstantScoreRewrite(constantScoreRewrite);
              return query;
            }
          

          Since ConstantScoreRangeQuery extends RangeQuery this works and should not change any behavior, right?
          I tried it out, "ant test-core test-tag" passes with this change.

          Show
          Michael Busch added a comment - Yeah, I understand the reason... and easy "fix" would be in QueryParser#newRangeQuery(): protected Query newRangeQuery( String field, String part1, String part2, boolean inclusive) { RangeQuery query; if (constantScoreRewrite) { // TODO: remove in Lucene 3.0 query = new ConstantScoreRangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator); } else { query = new RangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator); } query.setConstantScoreRewrite(constantScoreRewrite); return query; } Since ConstantScoreRangeQuery extends RangeQuery this works and should not change any behavior, right? I tried it out, "ant test-core test-tag" passes with this change.
          Hide
          Mark Miller added a comment -

          Now it returns a RangeQuery that is set for constant score mode. Its
          behavior is back compatible with ConstantScoreRangeQuery, but we no
          longer return it...

          Show
          Mark Miller added a comment - Now it returns a RangeQuery that is set for constant score mode. Its behavior is back compatible with ConstantScoreRangeQuery, but we no longer return it...
          Hide
          Michael Busch added a comment -

          Hmm the culprit is this line in 2.4's TestQueryParser.testRange():

          assertTrue(getQuery("[ a TO z]", null) instanceof ConstantScoreRangeQuery);

          Show
          Michael Busch added a comment - Hmm the culprit is this line in 2.4's TestQueryParser.testRange(): assertTrue(getQuery("[ a TO z]", null) instanceof ConstantScoreRangeQuery);
          Hide
          Michael Busch added a comment -

          Since this was committed I get a test failure when I run
          'ant -Dtag=lucene_2_4_0 test-tag'

          [junit] Testcase: testRange(org.apache.lucene.queryParser.TestQueryParser):FAILED
          [junit] null
          [junit] junit.framework.AssertionFailedError
          [junit] at org.apache.lucene.queryParser.TestQueryParser.testRange(TestQueryParser.java:418)

          Seems like this might have broken backwards-compatibility. I haven't looked into the reason yet.

          Show
          Michael Busch added a comment - Since this was committed I get a test failure when I run 'ant -Dtag=lucene_2_4_0 test-tag' [junit] Testcase: testRange(org.apache.lucene.queryParser.TestQueryParser):FAILED [junit] null [junit] junit.framework.AssertionFailedError [junit] at org.apache.lucene.queryParser.TestQueryParser.testRange(TestQueryParser.java:418) Seems like this might have broken backwards-compatibility. I haven't looked into the reason yet.
          Hide
          Michael McCandless added a comment -

          Committed revision 712890.

          Thanks Mark!

          Show
          Michael McCandless added a comment - Committed revision 712890. Thanks Mark!
          Hide
          Michael McCandless added a comment -

          OK this version looks good! I plan to commit sometime later today... thanks Mark!

          Show
          Michael McCandless added a comment - OK this version looks good! I plan to commit sometime later today... thanks Mark!
          Hide
          Mark Miller added a comment - - edited

          looks great!

          Sorry to do this, but I had to touch up the javadoc on RangeQuery as it no longer requires at least one term. I promise thats my last!

          Oh yeah, and co credit would probably be more accurate.

          Show
          Mark Miller added a comment - - edited looks great! Sorry to do this, but I had to touch up the javadoc on RangeQuery as it no longer requires at least one term. I promise thats my last! Oh yeah, and co credit would probably be more accurate.
          Hide
          Michael McCandless added a comment -

          Excellent! OK I made a few more changes, new patch attached. This is like playing ping-pong :

          • Fixed QueryParser.jj to match our changes to QueryParser.java
          • Fixed a few more unused imports, unused assignments, deprecated
            calls, javadocs, etc
          Show
          Michael McCandless added a comment - Excellent! OK I made a few more changes, new patch attached. This is like playing ping-pong : Fixed QueryParser.jj to match our changes to QueryParser.java Fixed a few more unused imports, unused assignments, deprecated calls, javadocs, etc
          Hide
          Mark Miller added a comment -

          +1 on all of your changes.

          Still had an issue with your patch with the two $id classes - I have no clue whats up with that, so I just manually paste the expanded line into my patch.

          Removed some unused imports in new/modified classes and further added/tweaked appropriate javadoc.

          Changed RangeQuery tests to use non deprecated constructors, exposed a null pointer issue with RangeQuery.equals and hashcode, fixed.

          Show
          Mark Miller added a comment - +1 on all of your changes. Still had an issue with your patch with the two $id classes - I have no clue whats up with that, so I just manually paste the expanded line into my patch. Removed some unused imports in new/modified classes and further added/tweaked appropriate javadoc. Changed RangeQuery tests to use non deprecated constructors, exposed a null pointer issue with RangeQuery.equals and hashcode, fixed.
          Hide
          Michael McCandless added a comment -

          OK I made some changes to the patch – I think we are real close now:

          • Overrode WildcardQuery.rewrite to return TermQuery if the pattern has
            no wildcards (your suggestion above).
          • Simplified MultiTermFilter: made it a static inner class, moved
            TermGenerator inside it (absorbed IdGenerator interface into it),
            move bits() and getDocIdSet() out of the anonymous class,
            simplified the equals method.
          • Removed extra TermGenerator class from PrefixQuery.
          • Fixed but RangeFilter's ctor – I think you could have gotten null
            rangeQuery if you pass in null collator.
          • Various small javadoc fixes, renaming, refactoring.
          • Deprecated ConstantScoreRangeQuery in favor of RangeQuery.
          • Overrode FuzzyQuery.setConstantScoreRewrite to throw
            UnsupportedOperationException, because it overrides rewrite.
          • Added set/getConstantScoreRewrite to QueryParser (defaults to
            true) and deprecated set/getUseOldRangeQuery.
          • Deprecated the ctor's of RangeQuery that take Term (in favor of
            String versions); removed new ctors in RangeQuery that take Term
            for upper/lower val.

          Mark can you look this over and see if the changes are OK? Thanks.

          Show
          Michael McCandless added a comment - OK I made some changes to the patch – I think we are real close now: Overrode WildcardQuery.rewrite to return TermQuery if the pattern has no wildcards (your suggestion above). Simplified MultiTermFilter: made it a static inner class, moved TermGenerator inside it (absorbed IdGenerator interface into it), move bits() and getDocIdSet() out of the anonymous class, simplified the equals method. Removed extra TermGenerator class from PrefixQuery. Fixed but RangeFilter's ctor – I think you could have gotten null rangeQuery if you pass in null collator. Various small javadoc fixes, renaming, refactoring. Deprecated ConstantScoreRangeQuery in favor of RangeQuery. Overrode FuzzyQuery.setConstantScoreRewrite to throw UnsupportedOperationException, because it overrides rewrite. Added set/getConstantScoreRewrite to QueryParser (defaults to true) and deprecated set/getUseOldRangeQuery. Deprecated the ctor's of RangeQuery that take Term (in favor of String versions); removed new ctors in RangeQuery that take Term for upper/lower val. Mark can you look this over and see if the changes are OK? Thanks.
          Hide
          Mark Miller added a comment -

          It looks like termContainsWildcard; and the check code can come out of wildcardquery. I pretty much got around that by making wc enum not blow up when the term doesn't contain a wildcard. There are other ways to do it that would keep the check if we prefer (a simple override of rewrite on wildcardquery would be the other option I like). I just liked the idea of WCTermEnum not blowing up like WCQuery doesnt blow up.

          Show
          Mark Miller added a comment - It looks like termContainsWildcard; and the check code can come out of wildcardquery. I pretty much got around that by making wc enum not blow up when the term doesn't contain a wildcard. There are other ways to do it that would keep the check if we prefer (a simple override of rewrite on wildcardquery would be the other option I like). I just liked the idea of WCTermEnum not blowing up like WCQuery doesnt blow up.
          Hide
          Michael McCandless added a comment -

          Excellent – patch applies perfectly and all tests pass. I'll review the changes. Thanks Mark!

          Show
          Michael McCandless added a comment - Excellent – patch applies perfectly and all tests pass. I'll review the changes. Thanks Mark!
          Hide
          Mark Miller added a comment -

          This time granting license to ASF

          Show
          Mark Miller added a comment - This time granting license to ASF
          Hide
          Mark Miller added a comment -

          Okay, fingers crossed, this is good.

          Eclipse was displaying the $id correctly, but it wasn't going out in the patch. Also, I didn't realize refactoring can cause issues like that, so I deleted and created a new class. Personally, I think thats fine, but let me know if you disagree. I also tested applying the patch to a fresh checkout and ran the tests. Should be good.

          I think I'm done unless (probably when) you prompt something else / something I missed.

          Show
          Mark Miller added a comment - Okay, fingers crossed, this is good. Eclipse was displaying the $id correctly, but it wasn't going out in the patch. Also, I didn't realize refactoring can cause issues like that, so I deleted and created a new class. Personally, I think thats fine, but let me know if you disagree. I also tested applying the patch to a fresh checkout and ran the tests. Should be good. I think I'm done unless (probably when) you prompt something else / something I missed.
          Hide
          Michael McCandless added a comment -

          Sorry about all the patches by the way.

          Whoa, no need to apologize – I in fact love all the patches and fast feedback/iterations. That's definitely the right way to do it. When I said "...but I sure hope this is the last patch" I meant "if I need to manually fix the patch failures". Since you are fixing the patch problems (thanks!) then it's all good – keep the patch iterations going.

          Show
          Michael McCandless added a comment - Sorry about all the patches by the way. Whoa, no need to apologize – I in fact love all the patches and fast feedback/iterations. That's definitely the right way to do it. When I said "...but I sure hope this is the last patch" I meant "if I need to manually fix the patch failures". Since you are fixing the patch problems (thanks!) then it's all good – keep the patch iterations going.
          Hide
          Mark Miller added a comment - - edited

          Don't work through it. I'm just using eclipse And did nothing
          different, but don't waste your time. Let me figure it out. I'll make
          sure it works without eclipse before putting it up.

          That test was just an eclipse name refactor.

          Sorry about all the patches by the way. Tends to be my style I tried to indicate which were merely progress patches (if someone has the interest, it allows my ,any wrong directions to be pointed out sooner) and which were something that should be considered more final review worthy. I am sure I can do a better job on that though. Certainly not my intention to waste anyones time, especially when that time could be better spent improving Lucene

          • Mark

          On Nov 8, 2008, at 12:34 PM, "Michael McCandless (JIRA)" <jira@apache.org

          Show
          Mark Miller added a comment - - edited Don't work through it. I'm just using eclipse And did nothing different, but don't waste your time. Let me figure it out. I'll make sure it works without eclipse before putting it up. That test was just an eclipse name refactor. Sorry about all the patches by the way. Tends to be my style I tried to indicate which were merely progress patches (if someone has the interest, it allows my ,any wrong directions to be pointed out sooner) and which were something that should be considered more final review worthy. I am sure I can do a better job on that though. Certainly not my intention to waste anyones time, especially when that time could be better spent improving Lucene Mark On Nov 8, 2008, at 12:34 PM, "Michael McCandless (JIRA)" <jira@apache.org
          Hide
          Michael McCandless added a comment -

          I'm having trouble applying the latest patch – I get failed Hunks, and, patch can't find one of the source files:

          patching file contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java
          patching file src/java/org/apache/lucene/queryParser/QueryParser.java
          patching file src/java/org/apache/lucene/search/ConstantScoreRangeQuery.java
          Hunk #2 FAILED at 29.
          1 out of 2 hunks FAILED -- saving rejects to file src/java/org/apache/lucene/search/ConstantScoreRangeQuery.java.rej
          patching file src/java/org/apache/lucene/search/MultiTermQuery.java
          patching file src/java/org/apache/lucene/search/PrefixFilter.java
          patching file src/java/org/apache/lucene/search/PrefixQuery.java
          patching file src/java/org/apache/lucene/search/PrefixTermEnum.java
          patching file src/java/org/apache/lucene/search/RangeFilter.java
          patching file src/java/org/apache/lucene/search/RangeQuery.java
          Hunk #1 FAILED at 21.
          1 out of 3 hunks FAILED -- saving rejects to file src/java/org/apache/lucene/search/RangeQuery.java.rej
          patching file src/java/org/apache/lucene/search/RangeTermEnum.java
          patching file src/java/org/apache/lucene/search/WildcardQuery.java
          patching file src/java/org/apache/lucene/search/WildcardTermEnum.java
          patching file src/test/org/apache/lucene/queryParser/TestQueryParser.java
          patching file src/test/org/apache/lucene/search/TestConstantScoreRangeQuery.java
          can't find file to patch at input line 1952
          Perhaps you used the wrong -p or --strip option?
          The text leading up to this was:
          --------------------------
          |Index: src/test/org/apache/lucene/search/TestMultiTermConstantScore.java
          |===================================================================
          |--- src/test/org/apache/lucene/search/TestMultiTermConstantScore.java	(revision 710055)
          |+++ src/test/org/apache/lucene/search/TestMultiTermConstantScore.java	(working copy)
          --------------------------
          File to patch: 
          

          Did you turn off keyword expansion in your local svn checkout? That seems to explain the first to failures (if I manually undo the $Id$ keyword expansion then the patch applies cleanly).

          On the 3rd failure, it seems like you renamed TestConstantScoreRangeQuery.java to TestMultiTermConstantScore.java, and then modified it? This confuses my patch client.

          I think I can work through these, but I sure hope this is the last patch

          Show
          Michael McCandless added a comment - I'm having trouble applying the latest patch – I get failed Hunks, and, patch can't find one of the source files: patching file contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java patching file src/java/org/apache/lucene/queryParser/QueryParser.java patching file src/java/org/apache/lucene/search/ConstantScoreRangeQuery.java Hunk #2 FAILED at 29. 1 out of 2 hunks FAILED -- saving rejects to file src/java/org/apache/lucene/search/ConstantScoreRangeQuery.java.rej patching file src/java/org/apache/lucene/search/MultiTermQuery.java patching file src/java/org/apache/lucene/search/PrefixFilter.java patching file src/java/org/apache/lucene/search/PrefixQuery.java patching file src/java/org/apache/lucene/search/PrefixTermEnum.java patching file src/java/org/apache/lucene/search/RangeFilter.java patching file src/java/org/apache/lucene/search/RangeQuery.java Hunk #1 FAILED at 21. 1 out of 3 hunks FAILED -- saving rejects to file src/java/org/apache/lucene/search/RangeQuery.java.rej patching file src/java/org/apache/lucene/search/RangeTermEnum.java patching file src/java/org/apache/lucene/search/WildcardQuery.java patching file src/java/org/apache/lucene/search/WildcardTermEnum.java patching file src/test/org/apache/lucene/queryParser/TestQueryParser.java patching file src/test/org/apache/lucene/search/TestConstantScoreRangeQuery.java can't find file to patch at input line 1952 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |Index: src/test/org/apache/lucene/search/TestMultiTermConstantScore.java |=================================================================== |--- src/test/org/apache/lucene/search/TestMultiTermConstantScore.java (revision 710055) |+++ src/test/org/apache/lucene/search/TestMultiTermConstantScore.java (working copy) -------------------------- File to patch: Did you turn off keyword expansion in your local svn checkout? That seems to explain the first to failures (if I manually undo the $Id$ keyword expansion then the patch applies cleanly). On the 3rd failure, it seems like you renamed TestConstantScoreRangeQuery.java to TestMultiTermConstantScore.java, and then modified it? This confuses my patch client. I think I can work through these, but I sure hope this is the last patch
          Hide
          Mark Miller added a comment -

          Okay, this should be good. Thanks for the guidance.

          Show
          Mark Miller added a comment - Okay, this should be good. Thanks for the guidance.
          Hide
          Mark Miller added a comment -

          Okay, almost there. Making constantscorerange query extend rangequery exposed a small problem though - there is scoring difference when using multiserver vs single searcher. The simple rangequery test in the multisearcher scoring test fails. The single searcher returns a score of 1 and the multi 1.4 - still trying to figure out how the heck that is happening...there is not a lot going on, so its prob obvious, but not sticking out at the moment...

          Show
          Mark Miller added a comment - Okay, almost there. Making constantscorerange query extend rangequery exposed a small problem though - there is scoring difference when using multiserver vs single searcher. The simple rangequery test in the multisearcher scoring test fails. The single searcher returns a score of 1 and the multi 1.4 - still trying to figure out how the heck that is happening...there is not a lot going on, so its prob obvious, but not sticking out at the moment...
          Hide
          Michael McCandless added a comment -

          Looks good! I think we're almost done here...

          Shouldn't we deprecate ConstantScoreRangeQuery, and maybe change it to
          simply subclass RangeQuery and set useConstantScoreRewrite to true?

          Should we match the ctor of ConstantScoreRangeQuery (field, lower, upper,
          inclLower, inclUpper) with RangeQuery? Right now you have Term for
          lower & upper, not String.

          Can we rename MultiTermQuery.isUseConstantScoreRewrite() to
          getUseConstantScoreRewrite()? Or maybe set/getConstantScoreRewrite()?

          There seems to be a leftover "abstract class TermGenerator implements
          IdGenerator" at the bottom of Prefixquery.java.

          Small whitespace issue: you need to insert a space in "if(..." in a
          few places.

          Show
          Michael McCandless added a comment - Looks good! I think we're almost done here... Shouldn't we deprecate ConstantScoreRangeQuery, and maybe change it to simply subclass RangeQuery and set useConstantScoreRewrite to true? Should we match the ctor of ConstantScoreRangeQuery (field, lower, upper, inclLower, inclUpper) with RangeQuery? Right now you have Term for lower & upper, not String. Can we rename MultiTermQuery.isUseConstantScoreRewrite() to getUseConstantScoreRewrite()? Or maybe set/getConstantScoreRewrite()? There seems to be a leftover "abstract class TermGenerator implements IdGenerator" at the bottom of Prefixquery.java. Small whitespace issue: you need to insert a space in "if(..." in a few places.
          Hide
          Mark Miller added a comment -

          Ahhhh...satisfying bringing order to all of that. It kaleidoscopes up quite nicely. Thanks for the help Michael.

          Back to just needing:

          review
          possibly some comments/comment changes
          tests for the constantscore side of the queries

          Show
          Mark Miller added a comment - Ahhhh...satisfying bringing order to all of that. It kaleidoscopes up quite nicely. Thanks for the help Michael. Back to just needing: review possibly some comments/comment changes tests for the constantscore side of the queries
          Hide
          Mark Miller added a comment -

          No worries, I'm still in the thick of it - plenty of little strings to
          wrap up, just wanted to make sure I was on the right track.

          Show
          Mark Miller added a comment - No worries, I'm still in the thick of it - plenty of little strings to wrap up, just wanted to make sure I was on the right track.
          Hide
          Michael McCandless added a comment -

          Excellent!

          Looks like PrefixGenerator got lost. Why not have it just get its bits by creating a PrefixQuery and calling getFilter().getDocIdSet()?

          WildcardFilter is still in the patch.

          Maybe fix getDocIdSet to not call bits() (since that's 2-pass, it's slower). In fact, does that Filter even need to implement bits? Can we throw a not implemented exception, since it's a new Filter?

          Show
          Michael McCandless added a comment - Excellent! Looks like PrefixGenerator got lost. Why not have it just get its bits by creating a PrefixQuery and calling getFilter().getDocIdSet()? WildcardFilter is still in the patch. Maybe fix getDocIdSet to not call bits() (since that's 2-pass, it's slower). In fact, does that Filter even need to implement bits? Can we throw a not implemented exception, since it's a new Filter?
          Hide
          Mark Miller added a comment -

          bq In favor of new QueryWrapperFilter(new XXXQuery(...))?

          I didn't end up going with the wrapper. Check out this patch just so we get back on the same page sooner rather than later...does this look like a doable way of doing it? It kills all the dupe code I think.

          We can just leave Prefixfilter since its their anyway, but the prefixquery wont use it.

          Show
          Mark Miller added a comment - bq In favor of new QueryWrapperFilter(new XXXQuery(...))? I didn't end up going with the wrapper. Check out this patch just so we get back on the same page sooner rather than later...does this look like a doable way of doing it? It kills all the dupe code I think. We can just leave Prefixfilter since its their anyway, but the prefixquery wont use it.
          Hide
          Mark Miller added a comment -

          I think we should indeed "factor up" into MultiTermQuery...

          Ah, I see...I hadn't thought all way up that path - it goes a bit further than was in my mind. The way to go for sure.

          All the comments make perfect sense, I'll work up a new patch while I'm at apachcon.

          Show
          Mark Miller added a comment - I think we should indeed "factor up" into MultiTermQuery... Ah, I see...I hadn't thought all way up that path - it goes a bit further than was in my mind. The way to go for sure. All the comments make perfect sense, I'll work up a new patch while I'm at apachcon.
          Hide
          Michael McCandless added a comment -

          Patch looks good! Thanks Mark.

          I think we should indeed "factor up" into MultiTermQuery the ability to create a ConstantScoreQuery out of the filter generated by enumerating the terms and walking the docs for those terms? Then we don't need to special case in each of the subclasses.

          Maybe we can then fix Wildcard/Prefix/RangeFilter to create the corresponding query and then ask it for its filter (assuming we make a method eg "getDocIdSet" in MultiTermQuery)? Or I guess we could just make a QueryWrapperFilter around the corresponding query, though that seems rather roundabout. I don't think we need to have duplicated code in PrefixFilter, RangeFilter, WildcardFilter.

          When I was talking about deprecating constantscorequery being awkward, its wasnt really in the implementation sense (i think we can leave it as is), but more the deprecating one of the newest queries already Still don't consider that a huge deal though.

          Well ... this is just how software evolves You can't control which code will be the "victim" of a nice refactoring. It's a healthy sign of progress, and progress is good!

          rangequery is not as expressive as constantscorequery, which can have separate inclusive/exclusive ends.

          If we do deprecate ConstantScoreRangeQuery (I think we should) then we could add a ctor to RangeQuery matching ConstantScoreRangeQuery's more expressive one?

          Show
          Michael McCandless added a comment - Patch looks good! Thanks Mark. I think we should indeed "factor up" into MultiTermQuery the ability to create a ConstantScoreQuery out of the filter generated by enumerating the terms and walking the docs for those terms? Then we don't need to special case in each of the subclasses. Maybe we can then fix Wildcard/Prefix/RangeFilter to create the corresponding query and then ask it for its filter (assuming we make a method eg "getDocIdSet" in MultiTermQuery)? Or I guess we could just make a QueryWrapperFilter around the corresponding query, though that seems rather roundabout. I don't think we need to have duplicated code in PrefixFilter, RangeFilter, WildcardFilter. When I was talking about deprecating constantscorequery being awkward, its wasnt really in the implementation sense (i think we can leave it as is), but more the deprecating one of the newest queries already Still don't consider that a huge deal though. Well ... this is just how software evolves You can't control which code will be the "victim" of a nice refactoring. It's a healthy sign of progress, and progress is good! rangequery is not as expressive as constantscorequery, which can have separate inclusive/exclusive ends. If we do deprecate ConstantScoreRangeQuery (I think we should) then we could add a ctor to RangeQuery matching ConstantScoreRangeQuery's more expressive one?
          Hide
          Mark Miller added a comment - - edited

          Heres an early review patch.

          I switched range and prefix to use multiterm query, but its debatable if thats necessary. Prob a few clock cycles slower, and the benefits are slim beyond overall code design niceness (which counts a lot to me <g>). Just not sure every mutlitermquery would want to have to implement a constantscore version, and the amount of code reuse saved is low (it could do a bit more with something like getConstantScoreQuery, but thats not much either - I may put it in come to think of it).

          TODO:

          look over
          some comments maybe
          some tests for the constant score versions
          rangequery is not as expressive as constantscorequery, which can have separate inclusive/exclusive ends.

          When I was talking about deprecating constantscorequery being awkward, its wasnt really in the implementation sense (i think we can leave it as is), but more the deprecating one of the newest queries already Still don't consider that a huge deal though.

          Show
          Mark Miller added a comment - - edited Heres an early review patch. I switched range and prefix to use multiterm query, but its debatable if thats necessary. Prob a few clock cycles slower, and the benefits are slim beyond overall code design niceness (which counts a lot to me <g>). Just not sure every mutlitermquery would want to have to implement a constantscore version, and the amount of code reuse saved is low (it could do a bit more with something like getConstantScoreQuery, but thats not much either - I may put it in come to think of it). TODO: look over some comments maybe some tests for the constant score versions rangequery is not as expressive as constantscorequery, which can have separate inclusive/exclusive ends. When I was talking about deprecating constantscorequery being awkward, its wasnt really in the implementation sense (i think we can leave it as is), but more the deprecating one of the newest queries already Still don't consider that a huge deal though.
          Hide
          Michael McCandless added a comment -

          OK, it sounds like we should preserve the rewrite-to-BooleanQuery
          option for each of these Query classes since there are clear use cases
          where it makes sense.

          I do like the idea of adding "constant score capability" to the
          existing query classes, instead of adding new ConstantScoreXXXQuery
          classes.

          I don't really like the REWRITE_GUESS option because it could lead to
          strange inconsistent results as seen by the end user – eg prefix1
          sorts one way but then a relaxed prefix2 sorts another way. I think a
          simple static binary choice is good?

          Couldn't we build this functionality (get/setRewriteMethod(...)) into
          MultiTermQuery? And then fix those queries that don't already to
          subclass MultiTermQuery (at least RangeQuery and PrefixQuery)?

          Finally, I would also prefer non-static methods to instruct
          QueryParser which rewrite method to use. In fact we already have
          setUseOldRangeQuery – maybe change this to
          setUseConstantScoreRewriteMethod (defaulting to true)?

          Show
          Michael McCandless added a comment - OK, it sounds like we should preserve the rewrite-to-BooleanQuery option for each of these Query classes since there are clear use cases where it makes sense. I do like the idea of adding "constant score capability" to the existing query classes, instead of adding new ConstantScoreXXXQuery classes. I don't really like the REWRITE_GUESS option because it could lead to strange inconsistent results as seen by the end user – eg prefix1 sorts one way but then a relaxed prefix2 sorts another way. I think a simple static binary choice is good? Couldn't we build this functionality (get/setRewriteMethod(...)) into MultiTermQuery? And then fix those queries that don't already to subclass MultiTermQuery (at least RangeQuery and PrefixQuery)? Finally, I would also prefer non-static methods to instruct QueryParser which rewrite method to use. In fact we already have setUseOldRangeQuery – maybe change this to setUseConstantScoreRewriteMethod (defaulting to true)?
          Hide
          Mark Miller added a comment -

          Sorry bout the @overrides...grabbed/copied from solr and forgot the java 1.4 cleanse. There was also one StringBuilder in the wildcardfilter that this patch pulls.

          Show
          Mark Miller added a comment - Sorry bout the @overrides...grabbed/copied from solr and forgot the java 1.4 cleanse. There was also one StringBuilder in the wildcardfilter that this patch pulls.
          Hide
          Hoss Man added a comment -

          Is there any reason to keep the rewrite-to-BooleanQuery

          In addition to the coord example Mark mentioned, there are also cases where the tf and idf values for terms matching a wildcard/prefix are meaningful ... but the other advantage to keeping the existing implementations is use cases where the number of unique terms a query expands to is is very small, but the number of documents matched is very large ... in theory (but i haven't tested this) the expanding queries should be more efficient in space and equally efficient in time as the ConstantScore equivalents.

          It would then be possible to just switch the QueryParser or other code by e.g. setting a global (static) flag.

          I'm loath to see more static "setter" methods added to query clauses, but there's no reason this couldn't be a member property on instances of the classes with corresponding properties on QueryParser.

          In theory it could even be a tertiary state property: REWRITE_TO_BOOLEAN, REWRITE_TO_CONSTANT_SCORE. REWRITE_GUESS ... where the third option caused the REWRITE method to make it's best guess using a first pass at the TermEnum iteration and aborting if the number of terms get's above some threshold.

          (but that kind of "optimization" would be premature without testing .. the point is it would be possible)

          Show
          Hoss Man added a comment - Is there any reason to keep the rewrite-to-BooleanQuery In addition to the coord example Mark mentioned, there are also cases where the tf and idf values for terms matching a wildcard/prefix are meaningful ... but the other advantage to keeping the existing implementations is use cases where the number of unique terms a query expands to is is very small, but the number of documents matched is very large ... in theory (but i haven't tested this) the expanding queries should be more efficient in space and equally efficient in time as the ConstantScore equivalents. It would then be possible to just switch the QueryParser or other code by e.g. setting a global (static) flag. I'm loath to see more static "setter" methods added to query clauses, but there's no reason this couldn't be a member property on instances of the classes with corresponding properties on QueryParser. In theory it could even be a tertiary state property: REWRITE_TO_BOOLEAN, REWRITE_TO_CONSTANT_SCORE. REWRITE_GUESS ... where the third option caused the REWRITE method to make it's best guess using a first pass at the TermEnum iteration and aborting if the number of terms get's above some threshold. (but that kind of "optimization" would be premature without testing .. the point is it would be possible)
          Hide
          Uwe Schindler added a comment -

          Just a suggestion:
          How about not inventing new Class names. Just include both code parts into e.g. (Range|Prefix|Wildcard)Query's rewrite() method. It would then be possible to just switch the QueryParser or other code by e.g. setting a global (static) flag in (Range|Prefix|Wildcard)Query that switches between both implementations).

          To your patch:
          For printing out the boost in toString(): there is a nice helper routine (instead of "if (getBoost()!=1.0f)...."). I think it should be consistent with toString() of all other core query types.

          Show
          Uwe Schindler added a comment - Just a suggestion: How about not inventing new Class names. Just include both code parts into e.g. (Range|Prefix|Wildcard)Query's rewrite() method. It would then be possible to just switch the QueryParser or other code by e.g. setting a global (static) flag in (Range|Prefix|Wildcard)Query that switches between both implementations). To your patch: For printing out the boost in toString(): there is a nice helper routine (instead of "if (getBoost()!=1.0f)...."). I think it should be consistent with toString() of all other core query types.
          Hide
          Mark Harwood added a comment -

          >> Are the score differences caused by the rewrite-to-BooleanQuery implementations ever "useful"?

          So we need to consider what we are losing - TF, IDF, coordination, length norm, doc boosts.

          I can only think of one use case which relates to coordination factor.

          If you have a "category" field for a product e.g. given Lucene docs for these books:

          Title: Lucene in Action
          Category: /Books/Computing/Languages/Java
          /Books/Computing/InformationRetrieval

          Title: The Long Tail
          Category: /Books/Business/Internet
          /Books/Computing

          You might then use a wildcard search of /Books/Computing/* and "Lucene in Action" would rank higher than "The Long Tail" because a BooleanQuery would score a higher coordination factor suggesting LIA got more hits under this "/Books/Computing.." category. There would still be the issue of IDF potentially skewing results but the coordination factor is potentially useful here.

          I think in general IDF tends to be useless for "auto-expanded" terms e.g. Wildcard, fuzzy etc. Incidentally, we still see that IDF issue in fuzzy queries ranking rare mis-spellings higher but that's another issue (one I resolved in contrib's FuzzyLikeThisQuery).

          I suppose one other consideration is for people who have created any doc boosts e.g. trying to use this to boost by date.

          I don't think any of these cases necessarily outweigh the benefit to be obtained from switching "wildcard/prefix to constant score queries"

          Cheers,
          Mark

          Show
          Mark Harwood added a comment - >> Are the score differences caused by the rewrite-to-BooleanQuery implementations ever "useful"? So we need to consider what we are losing - TF, IDF, coordination, length norm, doc boosts. I can only think of one use case which relates to coordination factor. If you have a "category" field for a product e.g. given Lucene docs for these books: Title: Lucene in Action Category: /Books/Computing/Languages/Java /Books/Computing/InformationRetrieval Title: The Long Tail Category: /Books/Business/Internet /Books/Computing You might then use a wildcard search of /Books/Computing/* and "Lucene in Action" would rank higher than "The Long Tail" because a BooleanQuery would score a higher coordination factor suggesting LIA got more hits under this "/Books/Computing.." category. There would still be the issue of IDF potentially skewing results but the coordination factor is potentially useful here. I think in general IDF tends to be useless for "auto-expanded" terms e.g. Wildcard, fuzzy etc. Incidentally, we still see that IDF issue in fuzzy queries ranking rare mis-spellings higher but that's another issue (one I resolved in contrib's FuzzyLikeThisQuery). I suppose one other consideration is for people who have created any doc boosts e.g. trying to use this to boost by date. I don't think any of these cases necessarily outweigh the benefit to be obtained from switching "wildcard/prefix to constant score queries" Cheers, Mark
          Hide
          Michael McCandless added a comment -

          Patch looks good. I plan to commit in a day or two.

          Show
          Michael McCandless added a comment - Patch looks good. I plan to commit in a day or two.
          Hide
          Michael McCandless added a comment -

          Woops... patch uses @Override (Java 5 only):

          common.compile-core:
              [mkdir] Created dir: /lucene/lucene.constant/build/classes/java
              [javac] Compiling 333 source files to /lucene/lucene.constant/build/classes/java
              [javac] /lucene/lucene.constant/src/java/org/apache/lucene/search/ConstantScoreWildcardQuery.java:69: annotations are not supported in -source 1.4
              [javac] (use -source 5 or higher to enable annotations)
              [javac]   @Override
              [javac]    ^
              [javac] /lucene/lucene.constant/src/java/org/apache/lucene/search/WildcardFilter.java:49: annotations are not supported in -source 1.4
              [javac] (use -source 5 or higher to enable annotations)
              [javac]   @Override
              [javac]    ^
              [javac] 2 errors
          

          I'll fix in my local checkout.

          Show
          Michael McCandless added a comment - Woops... patch uses @Override (Java 5 only): common.compile-core: [mkdir] Created dir: /lucene/lucene.constant/build/classes/java [javac] Compiling 333 source files to /lucene/lucene.constant/build/classes/java [javac] /lucene/lucene.constant/src/java/org/apache/lucene/search/ConstantScoreWildcardQuery.java:69: annotations are not supported in -source 1.4 [javac] (use -source 5 or higher to enable annotations) [javac] @Override [javac] ^ [javac] /lucene/lucene.constant/src/java/org/apache/lucene/search/WildcardFilter.java:49: annotations are not supported in -source 1.4 [javac] (use -source 5 or higher to enable annotations) [javac] @Override [javac] ^ [javac] 2 errors I'll fix in my local checkout.
          Hide
          Michael McCandless added a comment -

          What do people think about putting these classes in?

          +1

          I think in general we should more aggressively slurp useful things like this from Solr into Lucene.

          Is there any reason to keep the rewrite-to-BooleanQuery version of these (vs, deprecating them in favor of the ConstantScore variety)? Are the score differences caused by the rewrite-to-BooleanQuery implementations ever "useful"?

          Show
          Michael McCandless added a comment - What do people think about putting these classes in? +1 I think in general we should more aggressively slurp useful things like this from Solr into Lucene. Is there any reason to keep the rewrite-to-BooleanQuery version of these (vs, deprecating them in favor of the ConstantScore variety)? Are the score differences caused by the rewrite-to-BooleanQuery implementations ever "useful"?
          Hide
          Mark Miller added a comment -

          What do people think about putting these classes in? They have no max clause limitations, and reports have been that they can be much more efficient on large indexes. Solr has switched from using the boolean rewriting WildcardQuery,PrefixQuery to ConstantScore queries. We already put in the ConstantScoreRange query for just these reasons, so I think it makes since to add the rest of the family. An upside will be that I can add support for these queries to the Span highlighter. That will put our Highlighter in a position of being able to highlight pretty much every major query type, which I think is an important goal.

          • Mark
          Show
          Mark Miller added a comment - What do people think about putting these classes in? They have no max clause limitations, and reports have been that they can be much more efficient on large indexes. Solr has switched from using the boolean rewriting WildcardQuery,PrefixQuery to ConstantScore queries. We already put in the ConstantScoreRange query for just these reasons, so I think it makes since to add the rest of the family. An upside will be that I can add support for these queries to the Span highlighter. That will put our Highlighter in a position of being able to highlight pretty much every major query type, which I think is an important goal. Mark

            People

            • Assignee:
              Unassigned
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development