Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 4.1
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New
    1. booleansetperf.txt
      28 kB
      John Wang
    2. DisjunctionDISI.java
      7 kB
      Paul Elschot
    3. DisjunctionDISI.patch
      18 kB
      Eks Dev
    4. DisjunctionDISI.patch
      18 kB
      Eks Dev
    5. LUCENE-1345.patch
      33 kB
      Paul Elschot
    6. LUCENE-1345.patch
      48 kB
      Paul Elschot
    7. LUCENE-1345.patch
      27 kB
      Paul Elschot
    8. LUCENE-1345-Filter+Query-merge.patch
      9 kB
      Uwe Schindler
    9. OpenBitSetIteratorExperiment.java
      6 kB
      Eks Dev
    10. TestIteratorPerf.java
      3 kB
      Eks Dev
    11. TestIteratorPerf.java
      3 kB
      Eks Dev

      Issue Links

        Activity

        Hide
        Paul Elschot added a comment -

        A first half attempt, it still leaves a few compile errors.

        This would allow faster Filter evaluation because the tight loop
        would be in ConjunctionScorer only.

        It would also allow to get rid of Filter in most of the search api,
        as any Filter can just be added to a BooleanQuery.

        Would anyone have a DisjunctionDISI (Disjunction over DocIdSetIterators) somewhere?

        Show
        Paul Elschot added a comment - A first half attempt, it still leaves a few compile errors. This would allow faster Filter evaluation because the tight loop would be in ConjunctionScorer only. It would also allow to get rid of Filter in most of the search api, as any Filter can just be added to a BooleanQuery. Would anyone have a DisjunctionDISI (Disjunction over DocIdSetIterators) somewhere?
        Hide
        Eks Dev added a comment -

        Would anyone have a DisjunctionDISI (Disjunction over DocIdSetIterators) somewhere?

        I have played with DisjunctionSumScorer rip-off, maybe you find it useful for this issue...

        What would be nice here(and in DisjunctionSumScorer ), if possible?:

        • to remove initDISIQueue() from next() and skipTo() (also the same in DisjunctionSumScorer()) ... this is due to this ugly -1 position before first call, I just do not know how to get rid of it
        • to switch to Conjuction "mode" if minNrShouldMatch kicks in.... there are already todo-s for it arround

        if you think you can use it, just go ahead and include it in your patch, I am not using this for anything, just wrapped it up when you asked.

        Show
        Eks Dev added a comment - Would anyone have a DisjunctionDISI (Disjunction over DocIdSetIterators) somewhere? I have played with DisjunctionSumScorer rip-off, maybe you find it useful for this issue... What would be nice here(and in DisjunctionSumScorer ), if possible?: to remove initDISIQueue() from next() and skipTo() (also the same in DisjunctionSumScorer()) ... this is due to this ugly -1 position before first call, I just do not know how to get rid of it to switch to Conjuction "mode" if minNrShouldMatch kicks in.... there are already todo-s for it arround if you think you can use it, just go ahead and include it in your patch, I am not using this for anything, just wrapped it up when you asked.
        Hide
        Paul Elschot added a comment -

        Thanks for the DisjunctionDISI.patch. I had just continued, but I hadn't come that far yet.
        I'll be off quite irregularly in the next month, I'll try and attach here when there's real progress.

        Show
        Paul Elschot added a comment - Thanks for the DisjunctionDISI.patch. I had just continued, but I hadn't come that far yet. I'll be off quite irregularly in the next month, I'll try and attach here when there's real progress.
        Hide
        Eks Dev added a comment -

        I just realised TestDisjunctionDISI had a bug (iterators have to be reinitialized)...

        apart from that only small change in DISIQueue to use constants instead of vars (compiler should have done it as well, but you never know)

        private final void downHeap() {
        + int i = 1;
        + int j = 2; //i << 1; // find smaller child
        + int k = 3; //j + 1;
        +

        Show
        Eks Dev added a comment - I just realised TestDisjunctionDISI had a bug (iterators have to be reinitialized)... apart from that only small change in DISIQueue to use constants instead of vars (compiler should have done it as well, but you never know) private final void downHeap() { + int i = 1; + int j = 2; //i << 1; // find smaller child + int k = 3; //j + 1; +
        Hide
        Eks Dev added a comment -

        Hi Paul,
        I gave it a try on micro benchmarking, and it looks like we could gain a lot by switcing to sentinel approach for iterators, apart for being faster they are also a bit robuster to "one off" bugs.

        This test is just a simulation made assuming docId is long (I have tried it with int and it is about the same result).

        Just attaching it here as I did not want to create new issue for now, before we identify if there are some design/performance knock-out criteria.

        test on my setup:
        32bit java version "1.6.0_10-rc"
        java(TM) SE Runtime Environment (build 1.6.0_10-rc-b28)
        Windows XP Profesional 32bit
        notebook, 3Gb RAM,
        CPU x86 Family 6 Model 15 Stepping 11 GenuineIntel ~2194 Mhz

        java -server -Xbatch

        result (with docID long):
        old milliseconds=6938
        old milliseconds=6953
        old milliseconds=6890
        old milliseconds=6938
        old milliseconds=6906
        old milliseconds=6922
        old milliseconds=6906
        old milliseconds=6938
        old milliseconds=6906
        old milliseconds=6906
        old total milliseconds=69203

        new milliseconds=5797
        new milliseconds=5703
        new milliseconds=5266
        new milliseconds=5250
        new milliseconds=5234
        new milliseconds=5250
        new milliseconds=5235
        new milliseconds=5250
        new milliseconds=5250
        new milliseconds=5250
        new total milliseconds=53485
        New/Old Time 53485/69203 (77.28711%)

        all in all, faster more than 22% !!

        Of course, this type of benchmark does not mean all iterator ops in real life are going to be 20% faster... other things probably dominate, but if it proves that this test does not have some flaws (easy possible)... well worth of pursuing

        cheers, eks

        Show
        Eks Dev added a comment - Hi Paul, I gave it a try on micro benchmarking, and it looks like we could gain a lot by switcing to sentinel approach for iterators, apart for being faster they are also a bit robuster to "one off" bugs. This test is just a simulation made assuming docId is long (I have tried it with int and it is about the same result). Just attaching it here as I did not want to create new issue for now, before we identify if there are some design/performance knock-out criteria. test on my setup: 32bit java version "1.6.0_10-rc" java(TM) SE Runtime Environment (build 1.6.0_10-rc-b28) Windows XP Profesional 32bit notebook, 3Gb RAM, CPU x86 Family 6 Model 15 Stepping 11 GenuineIntel ~2194 Mhz java -server -Xbatch result (with docID long): old milliseconds=6938 old milliseconds=6953 old milliseconds=6890 old milliseconds=6938 old milliseconds=6906 old milliseconds=6922 old milliseconds=6906 old milliseconds=6938 old milliseconds=6906 old milliseconds=6906 old total milliseconds=69203 new milliseconds=5797 new milliseconds=5703 new milliseconds=5266 new milliseconds=5250 new milliseconds=5234 new milliseconds=5250 new milliseconds=5235 new milliseconds=5250 new milliseconds=5250 new milliseconds=5250 new total milliseconds=53485 New/Old Time 53485/69203 (77.28711%) all in all, faster more than 22% !! Of course, this type of benchmark does not mean all iterator ops in real life are going to be 20% faster... other things probably dominate, but if it proves that this test does not have some flaws (easy possible)... well worth of pursuing cheers, eks
        Hide
        Paul Elschot added a comment -

        Patch of 20080729: all tests pass, but no tests cases for filter clauses yet.

        Added BooleanFilterClause class, usable as argument to BooleanQuery.add().

        API change: made ReqExclScorer package private, added an arg to the constructor.

        Removed the queueSize variable in DisjunctionSumScorer and in the added DisjunctionDISI. Left the doc caching in ScorerDocQueue and in the added DisiDocQueue.

        It might be possible to subclass DisjunctionSumScorer from DisjunctionDISI,
        and to subclass ScorerDocQueue from DisiDocQueue, I have not checked that.

        Since ConjunctionScorer can handle DocIdSetIterators with this patch, it should improve the speed for Filters when they are added to a BooleanQuery instead of being used as through the current search API.

        Eks, thanks for DisjunctionDISI, I took it a bit further.

        Show
        Paul Elschot added a comment - Patch of 20080729: all tests pass, but no tests cases for filter clauses yet. Added BooleanFilterClause class, usable as argument to BooleanQuery.add(). API change: made ReqExclScorer package private, added an arg to the constructor. Removed the queueSize variable in DisjunctionSumScorer and in the added DisjunctionDISI. Left the doc caching in ScorerDocQueue and in the added DisiDocQueue. It might be possible to subclass DisjunctionSumScorer from DisjunctionDISI, and to subclass ScorerDocQueue from DisiDocQueue, I have not checked that. Since ConjunctionScorer can handle DocIdSetIterators with this patch, it should improve the speed for Filters when they are added to a BooleanQuery instead of being used as through the current search API. Eks, thanks for DisjunctionDISI, I took it a bit further.
        Hide
        Eks Dev added a comment -

        great! Will look into at at the weekend in more datails.

        I have moved this part to Constructor on my local copy, it passes all tests:

        + if (disiDocQueue == null)

        { + initDisiDocQueue(); + }

        it is in next() and skipTo()....

        practically the same as reported in https://issues.apache.org/jira/browse/LUCENE-1145, with this, 1145 can be closed

        Show
        Eks Dev added a comment - great! Will look into at at the weekend in more datails. I have moved this part to Constructor on my local copy, it passes all tests: + if (disiDocQueue == null) { + initDisiDocQueue(); + } it is in next() and skipTo().... practically the same as reported in https://issues.apache.org/jira/browse/LUCENE-1145 , with this, 1145 can be closed
        Hide
        Paul Elschot added a comment -

        20090729 is the date here, the attachment is dated 20080728, never mind.

        As to the sentinel for doc()/next() in the TestIteraratorPerf patch: this will need some real Scorers/DocIdSetIterators to see actual JIT compiler inlining in both cases. In the patch, the Old and New classes are local private classes, which are much easier to inline than separate, (non final) public classes.

        Show
        Paul Elschot added a comment - 20090729 is the date here, the attachment is dated 20080728, never mind. As to the sentinel for doc()/next() in the TestIteraratorPerf patch: this will need some real Scorers/DocIdSetIterators to see actual JIT compiler inlining in both cases. In the patch, the Old and New classes are local private classes, which are much easier to inline than separate, (non final) public classes.
        Hide
        Paul Elschot added a comment -

        Indeed, it makes sense to add the changes from LUCENE-1145 here.
        I remembered some discussion about this, but not that there was an issue open...

        Show
        Paul Elschot added a comment - Indeed, it makes sense to add the changes from LUCENE-1145 here. I remembered some discussion about this, but not that there was an issue open...
        Hide
        Yonik Seeley added a comment -

        Eks: just for grins, you can sometimes save a single cycle by changing "id==-1" to "id<0" (many operations on x86 automatically set status flags, hence comparison to zero can often be free). Not sure if the java optimizer will catch it though, and if it does if it would actually rise above the noise level.

        Show
        Yonik Seeley added a comment - Eks: just for grins, you can sometimes save a single cycle by changing "id==-1" to "id<0" (many operations on x86 automatically set status flags, hence comparison to zero can often be free). Not sure if the java optimizer will catch it though, and if it does if it would actually rise above the noise level.
        Hide
        Eks Dev added a comment -

        Yonik, this would probably work fine for int values (on my CPU), I have tried it on long values and this was significantly slower on this test... it boils down again to "what is the CPU we are optimizing for"

        Show
        Eks Dev added a comment - Yonik, this would probably work fine for int values (on my CPU), I have tried it on long values and this was significantly slower on this test... it boils down again to "what is the CPU we are optimizing for"
        Hide
        Yonik Seeley added a comment -

        I have tried it on long values and this was significantly slower on this test.

        Huh... I bet it's the test. It's probably so simple that everything is inlined and the comparison with -1 is being optimized away entirely (since a compare instruction is the same speed... doesn't matter if one is checking for equality or for less).

        Show
        Yonik Seeley added a comment - I have tried it on long values and this was significantly slower on this test. Huh... I bet it's the test. It's probably so simple that everything is inlined and the comparison with -1 is being optimized away entirely (since a compare instruction is the same speed... doesn't matter if one is checking for equality or for less).
        Hide
        Eks Dev added a comment -

        comparison with -1 is being optimized away entirely

        I do not think so, how compiler could "optimize away" the only condition that stops the loop? The loop would never finish, or am I misreading something here?

        Anyhow, the test is so simple that compiler can take completely other direction from the real case. I guess much better test (without too much effort!) would be to take something like OpenBitSetIterator and make one Iterator implementation with sentinel approach and then compare... this test is really just a dumb loop, but on the other side isolates the difference between two approaches...

        Show
        Eks Dev added a comment - comparison with -1 is being optimized away entirely I do not think so, how compiler could "optimize away" the only condition that stops the loop? The loop would never finish, or am I misreading something here? Anyhow, the test is so simple that compiler can take completely other direction from the real case. I guess much better test (without too much effort!) would be to take something like OpenBitSetIterator and make one Iterator implementation with sentinel approach and then compare... this test is really just a dumb loop, but on the other side isolates the difference between two approaches...
        Hide
        Paul Elschot added a comment -

        DisjunctionDisi.java of 20080729 has a first try of switching to conjunction mode, see advanceAfterCurrent and requiredDisis in there.

        I have not even compiled it yet, just showing the basic idea, no more time at the moment.

        Show
        Paul Elschot added a comment - DisjunctionDisi.java of 20080729 has a first try of switching to conjunction mode, see advanceAfterCurrent and requiredDisis in there. I have not even compiled it yet, just showing the basic idea, no more time at the moment.
        Hide
        Eks Dev added a comment -

        I just enhanced TestIteratorPerf to work with OpenBitSetIterator(Experiment)... on dense bit sets sentinel based are faster (ca 9%), on low density about the same?

        Yonik's tip -1 < doc instead of -1 != doc still performs worse, and knowing Yonik's hunch on these things, I am still not convinced it is really faster ...

        Paul's work here is more interesting, clear API and Performance win on many fronts...

        practically, no need to pollute this issue more with iterator semantics if I(or someone else) figure out something really interesting there, will create new issue ....

        Show
        Eks Dev added a comment - I just enhanced TestIteratorPerf to work with OpenBitSetIterator(Experiment)... on dense bit sets sentinel based are faster (ca 9%), on low density about the same? Yonik's tip -1 < doc instead of -1 != doc still performs worse, and knowing Yonik's hunch on these things, I am still not convinced it is really faster ... Paul's work here is more interesting, clear API and Performance win on many fronts... practically, no need to pollute this issue more with iterator semantics if I(or someone else) figure out something really interesting there, will create new issue ....
        Hide
        Yonik Seeley added a comment -

        Eks, I just tried your first TestIteratorPerf.java myself, and comparison with zero was faster (as expected) by about 8%
        I commented everything out except for testNew for simplicity.

        Original testNew:

        $ java -server -cp . TestIteratorPerf
        new  milliseconds=2883
        new  milliseconds=3289
        new  milliseconds=3148
        new  milliseconds=3195
        new  milliseconds=3149
        new  milliseconds=3179
        new  milliseconds=3180
        new  milliseconds=3164
        new  milliseconds=3179
        new  milliseconds=3164
        new total milliseconds=31530
        

        Modified testNew:
        // while(-1!=(doc=it.next())){
        while((doc=it.next()) >= 0)

        $ java -server -cp . TestIteratorPerf
        new  milliseconds=2806
        new  milliseconds=2899
        new  milliseconds=2915
        new  milliseconds=2899
        new  milliseconds=2914
        new  milliseconds=2899
        new  milliseconds=2899
        new  milliseconds=3040
        new  milliseconds=2899
        new  milliseconds=2930
        new total milliseconds=29100
        

        System: WinXP, Pentium4, java version "1.5.0_11"

        Show
        Yonik Seeley added a comment - Eks, I just tried your first TestIteratorPerf.java myself, and comparison with zero was faster (as expected) by about 8% I commented everything out except for testNew for simplicity. Original testNew: $ java -server -cp . TestIteratorPerf new milliseconds=2883 new milliseconds=3289 new milliseconds=3148 new milliseconds=3195 new milliseconds=3149 new milliseconds=3179 new milliseconds=3180 new milliseconds=3164 new milliseconds=3179 new milliseconds=3164 new total milliseconds=31530 Modified testNew: // while(-1!=(doc=it.next())){ while((doc=it.next()) >= 0) $ java -server -cp . TestIteratorPerf new milliseconds=2806 new milliseconds=2899 new milliseconds=2915 new milliseconds=2899 new milliseconds=2914 new milliseconds=2899 new milliseconds=2899 new milliseconds=3040 new milliseconds=2899 new milliseconds=2930 new total milliseconds=29100 System: WinXP, Pentium4, java version "1.5.0_11"
        Hide
        John Wang added a comment -

        Added perf comparisons with boolean set iterators with current scorers

        See patch

        System: Ubunto,
        java version "1.6.0_11"
        Intel core2 Duo 2.44ghz

        new milliseconds=470
        new milliseconds=534
        new milliseconds=450
        new milliseconds=443
        new milliseconds=444
        new milliseconds=445
        new milliseconds=449
        new milliseconds=441
        new milliseconds=444
        new milliseconds=445
        new total milliseconds=4565
        old milliseconds=529
        old milliseconds=491
        old milliseconds=428
        old milliseconds=549
        old milliseconds=427
        old milliseconds=424
        old milliseconds=420
        old milliseconds=424
        old milliseconds=423
        old milliseconds=422
        old total milliseconds=4537

        New/Old Time 4565/4537 (100.61715%)
        OrDocIdSetIterator milliseconds=1138
        OrDocIdSetIterator milliseconds=1106
        OrDocIdSetIterator milliseconds=1065
        OrDocIdSetIterator milliseconds=1066
        OrDocIdSetIterator milliseconds=1065
        OrDocIdSetIterator milliseconds=1067
        OrDocIdSetIterator milliseconds=1072
        OrDocIdSetIterator milliseconds=1118
        OrDocIdSetIterator milliseconds=1065
        OrDocIdSetIterator milliseconds=1069
        OrDocIdSetIterator total milliseconds=10831
        DisjunctionMaxScorer milliseconds=1914
        DisjunctionMaxScorer milliseconds=1981
        DisjunctionMaxScorer milliseconds=1861
        DisjunctionMaxScorer milliseconds=1893
        DisjunctionMaxScorer milliseconds=1886
        DisjunctionMaxScorer milliseconds=1885
        DisjunctionMaxScorer milliseconds=1887
        DisjunctionMaxScorer milliseconds=1889
        DisjunctionMaxScorer milliseconds=1891
        DisjunctionMaxScorer milliseconds=1888
        DisjunctionMaxScorer total milliseconds=18975
        Or/DisjunctionMax Time 10831/18975 (57.080368%)
        OrDocIdSetIterator milliseconds=1079
        OrDocIdSetIterator milliseconds=1075
        OrDocIdSetIterator milliseconds=1076
        OrDocIdSetIterator milliseconds=1093
        OrDocIdSetIterator milliseconds=1077
        OrDocIdSetIterator milliseconds=1074
        OrDocIdSetIterator milliseconds=1078
        OrDocIdSetIterator milliseconds=1075
        OrDocIdSetIterator milliseconds=1074
        OrDocIdSetIterator milliseconds=1074
        OrDocIdSetIterator total milliseconds=10775
        DisjunctionSumScorer milliseconds=1398
        DisjunctionSumScorer milliseconds=1322
        DisjunctionSumScorer milliseconds=1320
        DisjunctionSumScorer milliseconds=1305
        DisjunctionSumScorer milliseconds=1304
        DisjunctionSumScorer milliseconds=1301
        DisjunctionSumScorer milliseconds=1304
        DisjunctionSumScorer milliseconds=1300
        DisjunctionSumScorer milliseconds=1301
        DisjunctionSumScorer milliseconds=1317
        DisjunctionSumScorer total milliseconds=13172
        Or/DisjunctionSum Time 10775/13172 (81.80231%)
        AndDocIdSetIterator milliseconds=330
        AndDocIdSetIterator milliseconds=336
        AndDocIdSetIterator milliseconds=298
        AndDocIdSetIterator milliseconds=299
        AndDocIdSetIterator milliseconds=310
        AndDocIdSetIterator milliseconds=298
        AndDocIdSetIterator milliseconds=298
        AndDocIdSetIterator milliseconds=334
        AndDocIdSetIterator milliseconds=298
        AndDocIdSetIterator milliseconds=299
        AndDocIdSetIterator total milliseconds=3100
        ConjunctionScorer milliseconds=332
        ConjunctionScorer milliseconds=307
        ConjunctionScorer milliseconds=302
        ConjunctionScorer milliseconds=350
        ConjunctionScorer milliseconds=300
        ConjunctionScorer milliseconds=304
        ConjunctionScorer milliseconds=305
        ConjunctionScorer milliseconds=303
        ConjunctionScorer milliseconds=303
        ConjunctionScorer milliseconds=299
        ConjunctionScorer total milliseconds=3105
        And/Conjunction Time 3100/3105 (99.83897%)

        Show
        John Wang added a comment - Added perf comparisons with boolean set iterators with current scorers See patch System: Ubunto, java version "1.6.0_11" Intel core2 Duo 2.44ghz new milliseconds=470 new milliseconds=534 new milliseconds=450 new milliseconds=443 new milliseconds=444 new milliseconds=445 new milliseconds=449 new milliseconds=441 new milliseconds=444 new milliseconds=445 new total milliseconds=4565 old milliseconds=529 old milliseconds=491 old milliseconds=428 old milliseconds=549 old milliseconds=427 old milliseconds=424 old milliseconds=420 old milliseconds=424 old milliseconds=423 old milliseconds=422 old total milliseconds=4537 New/Old Time 4565/4537 (100.61715%) OrDocIdSetIterator milliseconds=1138 OrDocIdSetIterator milliseconds=1106 OrDocIdSetIterator milliseconds=1065 OrDocIdSetIterator milliseconds=1066 OrDocIdSetIterator milliseconds=1065 OrDocIdSetIterator milliseconds=1067 OrDocIdSetIterator milliseconds=1072 OrDocIdSetIterator milliseconds=1118 OrDocIdSetIterator milliseconds=1065 OrDocIdSetIterator milliseconds=1069 OrDocIdSetIterator total milliseconds=10831 DisjunctionMaxScorer milliseconds=1914 DisjunctionMaxScorer milliseconds=1981 DisjunctionMaxScorer milliseconds=1861 DisjunctionMaxScorer milliseconds=1893 DisjunctionMaxScorer milliseconds=1886 DisjunctionMaxScorer milliseconds=1885 DisjunctionMaxScorer milliseconds=1887 DisjunctionMaxScorer milliseconds=1889 DisjunctionMaxScorer milliseconds=1891 DisjunctionMaxScorer milliseconds=1888 DisjunctionMaxScorer total milliseconds=18975 Or/DisjunctionMax Time 10831/18975 (57.080368%) OrDocIdSetIterator milliseconds=1079 OrDocIdSetIterator milliseconds=1075 OrDocIdSetIterator milliseconds=1076 OrDocIdSetIterator milliseconds=1093 OrDocIdSetIterator milliseconds=1077 OrDocIdSetIterator milliseconds=1074 OrDocIdSetIterator milliseconds=1078 OrDocIdSetIterator milliseconds=1075 OrDocIdSetIterator milliseconds=1074 OrDocIdSetIterator milliseconds=1074 OrDocIdSetIterator total milliseconds=10775 DisjunctionSumScorer milliseconds=1398 DisjunctionSumScorer milliseconds=1322 DisjunctionSumScorer milliseconds=1320 DisjunctionSumScorer milliseconds=1305 DisjunctionSumScorer milliseconds=1304 DisjunctionSumScorer milliseconds=1301 DisjunctionSumScorer milliseconds=1304 DisjunctionSumScorer milliseconds=1300 DisjunctionSumScorer milliseconds=1301 DisjunctionSumScorer milliseconds=1317 DisjunctionSumScorer total milliseconds=13172 Or/DisjunctionSum Time 10775/13172 (81.80231%) AndDocIdSetIterator milliseconds=330 AndDocIdSetIterator milliseconds=336 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=299 AndDocIdSetIterator milliseconds=310 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=334 AndDocIdSetIterator milliseconds=298 AndDocIdSetIterator milliseconds=299 AndDocIdSetIterator total milliseconds=3100 ConjunctionScorer milliseconds=332 ConjunctionScorer milliseconds=307 ConjunctionScorer milliseconds=302 ConjunctionScorer milliseconds=350 ConjunctionScorer milliseconds=300 ConjunctionScorer milliseconds=304 ConjunctionScorer milliseconds=305 ConjunctionScorer milliseconds=303 ConjunctionScorer milliseconds=303 ConjunctionScorer milliseconds=299 ConjunctionScorer total milliseconds=3105 And/Conjunction Time 3100/3105 (99.83897%)
        Hide
        John Wang added a comment -

        Added And/Or/Not DocidSet/Iterators

        code ported over from Kamikaze:
        http://code.google.com/p/lucene-ext/

        Perf test updated.

        main contributors to the patch: Anmol Bhasin & Yasuhiro Matsuda

        Show
        John Wang added a comment - Added And/Or/Not DocidSet/Iterators code ported over from Kamikaze: http://code.google.com/p/lucene-ext/ Perf test updated. main contributors to the patch: Anmol Bhasin & Yasuhiro Matsuda
        Hide
        John Wang added a comment -

        Given the perf number improvements we see, can we consider up the priority?

        Show
        John Wang added a comment - Given the perf number improvements we see, can we consider up the priority?
        Hide
        Michael McCandless added a comment -

        > Given the perf number improvements we see, can we consider up the priority?

        I agree, the results look compelling; I marked this and LUCENE-1145 as fix version 2.9.

        Show
        Michael McCandless added a comment - > Given the perf number improvements we see, can we consider up the priority? I agree, the results look compelling; I marked this and LUCENE-1145 as fix version 2.9.
        Hide
        Marvin Humphrey added a comment -

        Paul Elschot, a while back:

        > It would also allow to get rid of Filter in most of the search api,
        > as any Filter can just be added to a BooleanQuery.

        In KS svn trunk (and potentially in Lucy), there is no "Filter"; all classes
        that perform filtering are just subclasses of Query which you're expected to
        apply using an ANDQuery. Can you think of any downside to that model? (Would
        it be possible to retrofit Lucene to use it in 3.0?) The motivation was the
        same as the one you articulate: to simplify the search API.

        (Hmm...Thinking out loud: DeletionsFilter as a subclass of Query...)

        Show
        Marvin Humphrey added a comment - Paul Elschot, a while back: > It would also allow to get rid of Filter in most of the search api, > as any Filter can just be added to a BooleanQuery. In KS svn trunk (and potentially in Lucy), there is no "Filter"; all classes that perform filtering are just subclasses of Query which you're expected to apply using an ANDQuery. Can you think of any downside to that model? (Would it be possible to retrofit Lucene to use it in 3.0?) The motivation was the same as the one you articulate: to simplify the search API. (Hmm...Thinking out loud: DeletionsFilter as a subclass of Query...)
        Hide
        Michael McCandless added a comment -

        > (Hmm...Thinking out loud: DeletionsFilter as a subclass of Query...)

        +1

        Show
        Michael McCandless added a comment - > (Hmm...Thinking out loud: DeletionsFilter as a subclass of Query...) +1
        Hide
        John Wang added a comment -

        Filters by definition (afaik) does not participate in scoring. Since "score gathering" is done at the BooleanQuery level, does it mean BooleanQuery would need to do instanceof check to see if it is a Filter?

        Or do we always hardcode filter with score 0? This is also dangerous if people do augment scores at hitcollector level or score gathering logic changes to something not as straightforward as summing.

        my two cents.

        Show
        John Wang added a comment - Filters by definition (afaik) does not participate in scoring. Since "score gathering" is done at the BooleanQuery level, does it mean BooleanQuery would need to do instanceof check to see if it is a Filter? Or do we always hardcode filter with score 0? This is also dangerous if people do augment scores at hitcollector level or score gathering logic changes to something not as straightforward as summing. my two cents.
        Hide
        Paul Elschot added a comment -

        Marvin,

        In KS svn trunk (and potentially in Lucy), there is no "Filter"; all classes that perform filtering are just subclasses of Query which you're expected to apply using an ANDQuery. Can you think of any downside to that model?

        In Lucene the class model is that Scorer extends DocIdSetIterator by some
        methods involved with document score values. To prepare searching in
        Lucene the following 'transformations' are done:
        Query -> Weight -> Scorer
        and
        Filter -> DocIdSetIterator

        I've never seen the KS classes, but on the face of it, the downside of using
        ANDQuery (KS) for filtering is that it has to provide a score value, which
        somehow must be ignored during search.

        Show
        Paul Elschot added a comment - Marvin, In KS svn trunk (and potentially in Lucy), there is no "Filter"; all classes that perform filtering are just subclasses of Query which you're expected to apply using an ANDQuery. Can you think of any downside to that model? In Lucene the class model is that Scorer extends DocIdSetIterator by some methods involved with document score values. To prepare searching in Lucene the following 'transformations' are done: Query -> Weight -> Scorer and Filter -> DocIdSetIterator I've never seen the KS classes, but on the face of it, the downside of using ANDQuery (KS) for filtering is that it has to provide a score value, which somehow must be ignored during search.
        Hide
        Paul Elschot added a comment - - edited

        John, Michael,

        Given the perf number improvements we see, can we consider up the priority?

        I think most of the performance improvements that John posted can be moved into
        trunk without the addition of Filter as a clause to BooleanQuery. Therefore I'd rather let
        these go in before adding Filter as clause to BooleanQuery.

        Show
        Paul Elschot added a comment - - edited John, Michael, Given the perf number improvements we see, can we consider up the priority? I think most of the performance improvements that John posted can be moved into trunk without the addition of Filter as a clause to BooleanQuery. Therefore I'd rather let these go in before adding Filter as clause to BooleanQuery.
        Hide
        Uwe Schindler added a comment -

        Here is a nice idea, how to merge Filters and Queries:

        Why not just combine ConstantScoreQuery and the current abstract Filter APIs to a new Filter class. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery and it would even be backwards compatible. If somebody implements the filters getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query.

        For some performance improvements when combining more than one filter in a BooleanQuery (e.g. anding/oring the iterators, filtering,...) the code of BooleanQuery could use instanceof.

        Show
        Uwe Schindler added a comment - Here is a nice idea, how to merge Filters and Queries: Why not just combine ConstantScoreQuery and the current abstract Filter APIs to a new Filter class. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery and it would even be backwards compatible. If somebody implements the filters getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. For some performance improvements when combining more than one filter in a BooleanQuery (e.g. anding/oring the iterators, filtering,...) the code of BooleanQuery could use instanceof.
        Hide
        Uwe Schindler added a comment -

        Attached is a patch, that merges ConstantScoreQuery and Filter APIs. The ConstantScoreQuery is deprecated by this. Further cleanups then may remove the use of ConstantScoreQuery from other places in Lucene, where the class is currently used to wrap filters as Queries.

        All important tests pass! Only one test does not pass: a problem occurs in TestSimpleExplanations, but this may be because the changed toString()/toString(field) and query explanations (because ConstantScoreQuery no longer returns an explanation), this may be cha nged or the test should be rewritten. Nevertheless, I do not understand the whole test case

        Uwe

        Show
        Uwe Schindler added a comment - Attached is a patch, that merges ConstantScoreQuery and Filter APIs. The ConstantScoreQuery is deprecated by this. Further cleanups then may remove the use of ConstantScoreQuery from other places in Lucene, where the class is currently used to wrap filters as Queries. All important tests pass! Only one test does not pass: a problem occurs in TestSimpleExplanations, but this may be because the changed toString()/toString(field) and query explanations (because ConstantScoreQuery no longer returns an explanation), this may be cha nged or the test should be rewritten. Nevertheless, I do not understand the whole test case Uwe
        Hide
        Paul Elschot added a comment -

        Uwe,

        The point here is to let BooleanQuery also take care of the filtering logic without doing any extra score computations.

        For example that involves changing ConjunctionScorer to not only accept Scorers, but also DocIdSetIterators,
        and use these DocIdSetIterators together with the Scorers to skip to the next matching document, but only use the Scorers to compute the score value.

        What is the point of adding a score value to Filters, when that score value has to be ignored during query search?

        Show
        Paul Elschot added a comment - Uwe, The point here is to let BooleanQuery also take care of the filtering logic without doing any extra score computations. For example that involves changing ConjunctionScorer to not only accept Scorers, but also DocIdSetIterators, and use these DocIdSetIterators together with the Scorers to skip to the next matching document, but only use the Scorers to compute the score value. What is the point of adding a score value to Filters, when that score value has to be ignored during query search?
        Hide
        Uwe Schindler added a comment -

        The idea behind the patch was to merge the code of filters and queries. Further optimizations now can remove the score calculation from the filter code.

        Using my patch you are now be able to add filters to BooleanQueries or directly execute them using Searcher.search, because they are subclasses of Query. Further optimizations now may remove the score computation in complete, if the given query extends Filter (if (query instanceof Filter) do something other).

        Show
        Uwe Schindler added a comment - The idea behind the patch was to merge the code of filters and queries. Further optimizations now can remove the score calculation from the filter code. Using my patch you are now be able to add filters to BooleanQueries or directly execute them using Searcher.search, because they are subclasses of Query. Further optimizations now may remove the score computation in complete, if the given query extends Filter (if (query instanceof Filter) do something other).
        Hide
        Paul Elschot added a comment -

        Further optimizations now may remove the score computation in complete, if the given query extends Filter (if (query instanceof Filter) do something other)

        Such further optimization is precisily the idea of the original patch here, but without making Filter a subclass of Query.

        Show
        Paul Elschot added a comment - Further optimizations now may remove the score computation in complete, if the given query extends Filter (if (query instanceof Filter) do something other) Such further optimization is precisily the idea of the original patch here, but without making Filter a subclass of Query.
        Hide
        Uwe Schindler added a comment -

        I know this. My idea was just to remove the burden of thinking about Filters and Queries for the developer of Lucene applications.

        In my opinion, the terms "Query" and "Filter" should be merged. Logic behind BooleanQuery or Searcher should simply think about the *best" logic how to optimize what the user wants to do.

        Maybe I should create an new JIRA issue out of my suggestion to merge Filters and Queries? In my opinion, this is something nice to have in 3.0.

        Show
        Uwe Schindler added a comment - I know this. My idea was just to remove the burden of thinking about Filters and Queries for the developer of Lucene applications. In my opinion, the terms "Query" and "Filter" should be merged. Logic behind BooleanQuery or Searcher should simply think about the *best" logic how to optimize what the user wants to do. Maybe I should create an new JIRA issue out of my suggestion to merge Filters and Queries? In my opinion, this is something nice to have in 3.0.
        Hide
        Paul Elschot added a comment -

        In my opinion, the terms "Query" and "Filter" should be merged.

        There is clear distinction between the two terms.
        QueryWrapperFilter changes a Query into a Filter and ConstantScoreQuery changes a Filter into a Query.
        The first one removes the scoring by upcasting a Scorer to a DocIdSetIterator, and the second one adds a constant score to a DocIdSetIterator to create a Scorer.

        Show
        Paul Elschot added a comment - In my opinion, the terms "Query" and "Filter" should be merged. There is clear distinction between the two terms. QueryWrapperFilter changes a Query into a Filter and ConstantScoreQuery changes a Filter into a Query. The first one removes the scoring by upcasting a Scorer to a DocIdSetIterator, and the second one adds a constant score to a DocIdSetIterator to create a Scorer.
        Hide
        Uwe Schindler added a comment -

        In my opinion, the terms "Query" and "Filter" should be merged.

        There is clear distinction between the two terms.
        QueryWrapperFilter changes a Query into a Filter and ConstantScoreQuery changes a Filter into a Query.
        The first one removes the scoring by upcasting a Scorer to a DocIdSetIterator, and the second one adds a constant score to a DocIdSetIterator to create a Scorer.

        You are right, but for a Lucene user there is always the problem of the distiction between both terms. When combining both, the user would get less burden on thinking about both. It would make life easier, and would hide some work for the user. The problem are the fine differences between the both, but for the general user who does not have such large indexes where the difference between both counts, it would makte things easier.

        How about merging Filters and Queries and then thinking about optimizations in the code of BooleanQuery to identify use cases where the scoring can be removed and where a constant score is needed. There are two cases where the two different types make problems:

        • user (A) wants to use my contrib TrieRangeQuery/-Filter and just execute a Query that returns documents that match the Range. The problem for this user is: How to implement this? User a MatchAllDocsQuery and filter the results with TrieRangeFilter or use ConstantScoreQuery to combine both? What is faster?
        • user (B) wants to filter some documents using a normal Filter. If he uses the standard Query+Filter combination of Searcher.search() he must before distinguish what part of the combinations should be the filter and what should be the query. Maybe he got a TrieRangeQuery (the query one using a ConstantScore on the Filter) as query and want it combine with another query. With the new code that detects the type of both clauses, BooleanQuery code could choose to execute the TermQueries as normal scorer query and filter the results using the given Filter as clause.

        Both tasks could be easily combined if Query and Filter would be the same. The user (A) would not need to create a constant score query on the Trie filter, he could just use it with Searcher.search() as a "Query". If he want to add some normal term queries from a query parser to it, he would use a BooleanQuery to combine both. The BooleanQuery code would then find out that one of the clauses is a Filter and would not use ConstantScore code to filter the result and just use the normal filter code. For the user it is simplier: He would always create a TrieRangeQueryFilter combination and would let BooleanQuery choose what query execution strategy to use.

        Show
        Uwe Schindler added a comment - In my opinion, the terms "Query" and "Filter" should be merged. There is clear distinction between the two terms. QueryWrapperFilter changes a Query into a Filter and ConstantScoreQuery changes a Filter into a Query. The first one removes the scoring by upcasting a Scorer to a DocIdSetIterator, and the second one adds a constant score to a DocIdSetIterator to create a Scorer. You are right, but for a Lucene user there is always the problem of the distiction between both terms. When combining both, the user would get less burden on thinking about both. It would make life easier, and would hide some work for the user. The problem are the fine differences between the both, but for the general user who does not have such large indexes where the difference between both counts, it would makte things easier. How about merging Filters and Queries and then thinking about optimizations in the code of BooleanQuery to identify use cases where the scoring can be removed and where a constant score is needed. There are two cases where the two different types make problems: user (A) wants to use my contrib TrieRangeQuery/-Filter and just execute a Query that returns documents that match the Range. The problem for this user is: How to implement this? User a MatchAllDocsQuery and filter the results with TrieRangeFilter or use ConstantScoreQuery to combine both? What is faster? user (B) wants to filter some documents using a normal Filter. If he uses the standard Query+Filter combination of Searcher.search() he must before distinguish what part of the combinations should be the filter and what should be the query. Maybe he got a TrieRangeQuery (the query one using a ConstantScore on the Filter) as query and want it combine with another query. With the new code that detects the type of both clauses, BooleanQuery code could choose to execute the TermQueries as normal scorer query and filter the results using the given Filter as clause. Both tasks could be easily combined if Query and Filter would be the same. The user (A) would not need to create a constant score query on the Trie filter, he could just use it with Searcher.search() as a "Query". If he want to add some normal term queries from a query parser to it, he would use a BooleanQuery to combine both. The BooleanQuery code would then find out that one of the clauses is a Filter and would not use ConstantScore code to filter the result and just use the normal filter code. For the user it is simplier: He would always create a TrieRangeQueryFilter combination and would let BooleanQuery choose what query execution strategy to use.
        Hide
        Uwe Schindler added a comment -

        An additional case: User (A) uses a BooleanQuery and just adds the Filter to it and nothing more (no TermQueries and so on). In this case, ConstantScore algorithm must be used! But for the end user the API is always identical.

        Show
        Uwe Schindler added a comment - An additional case: User (A) uses a BooleanQuery and just adds the Filter to it and nothing more (no TermQueries and so on). In this case, ConstantScore algorithm must be used! But for the end user the API is always identical.
        Hide
        Uwe Schindler added a comment -

        Here some ideas how to implement search() with Query and Filter:

        • User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents.
        • User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before
        • User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters.

        I hope this explains how I would implement the combined Filters and Queries.

        Show
        Uwe Schindler added a comment - Here some ideas how to implement search() with Query and Filter: User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. I hope this explains how I would implement the combined Filters and Queries.
        Hide
        Paul Elschot added a comment -

        In case there is no score value for each matching document it is not possible to order the results to be presented to a user. Because of that I don't want BooleanQuery to run correctly without at least one normal query clause to provide a score to order the results. Putting this in another way: TopDocs (and the meanwhile deprecated Hits) make no sense without at least one normal query clause.

        There are cases when all results are needed, and in that case for a Query one normally uses to the HitCollector API. For Filters one could provide a MatchCollector API that collects all hits providing only the document numbers, and no score (as in HitCollector). This was part of the earlier versions of the patches at LUCENE-584 that introduced the new Filter API, but it was dropped because it was new functionality that was not really needed at the time.
        The same MatchCollector API can also be provided for Query searching. At that point, there is no difference between Query and Filter.

        Show
        Paul Elschot added a comment - In case there is no score value for each matching document it is not possible to order the results to be presented to a user. Because of that I don't want BooleanQuery to run correctly without at least one normal query clause to provide a score to order the results. Putting this in another way: TopDocs (and the meanwhile deprecated Hits) make no sense without at least one normal query clause. There are cases when all results are needed, and in that case for a Query one normally uses to the HitCollector API. For Filters one could provide a MatchCollector API that collects all hits providing only the document numbers, and no score (as in HitCollector). This was part of the earlier versions of the patches at LUCENE-584 that introduced the new Filter API, but it was dropped because it was new functionality that was not really needed at the time. The same MatchCollector API can also be provided for Query searching. At that point, there is no difference between Query and Filter.
        Hide
        Uwe Schindler added a comment -

        But where is the problem then: You only mean that BooleanQueries with only a Filter clause are not sortable. In my opinion this is not a real problem: I do for example use ConstantScoreQuery with a TrieRangeFilter as the only query constraint. All results are returned with score=1, this is similar to a classic SQL database. The default order of equal scores is index order, so it makes sense for TopDocs. And if you additionally add a SortField it makes more sense

        So why not implement the three cases as described in my last message and use a ConstantScoreQuery, if the Filter is alone in the BooleanQuery. In this case all options can be used with a simple API with no difference between Filters and Queries? The MatchCollector API in my opinion is also not needed. In the case of constant score (=1.0) why not simple call collect(dociId, 1.0f)? Why do we need a new API just because the score is not needed. Just define 1.0 as the score (like ConstantScoreQuery does).

        Show
        Uwe Schindler added a comment - But where is the problem then: You only mean that BooleanQueries with only a Filter clause are not sortable. In my opinion this is not a real problem: I do for example use ConstantScoreQuery with a TrieRangeFilter as the only query constraint. All results are returned with score=1, this is similar to a classic SQL database. The default order of equal scores is index order, so it makes sense for TopDocs. And if you additionally add a SortField it makes more sense So why not implement the three cases as described in my last message and use a ConstantScoreQuery, if the Filter is alone in the BooleanQuery. In this case all options can be used with a simple API with no difference between Filters and Queries? The MatchCollector API in my opinion is also not needed. In the case of constant score (=1.0) why not simple call collect(dociId, 1.0f)? Why do we need a new API just because the score is not needed. Just define 1.0 as the score (like ConstantScoreQuery does).
        Hide
        Marvin Humphrey added a comment -

        Paul Elschot:

        > To prepare searching in Lucene the following 'transformations' are done:
        >
        > Query -> Weight -> Scorer
        > and
        > Filter -> DocIdSetIterator
        >
        > I've never seen the KS classes

        KS's Search classes used to be pretty direct ports from the Lucene Search
        hierarchy – because when I was doing the work I had so much trouble grokking
        it that I felt there was no choice but to cargo cult.

        Since then, many changes have been made. Here are some that are germane to
        this discussion:

        • Filter has been eliminated, and filtering subclasses have been made into
          subclasses of Query.
        • Weight has been made a subclass of Query and renamed to "Compiler".
        • BooleanQuery has been replaced by the foursome of ANDQuery, ORQuery,
          NOTQuery, and RequiredOptionalQuery, all of which descend from the common
          parent PolyQuery, and which compile to scorers roughly akin to those from
          Lucene's BooleanScorer2 (i.e. they implement Scorer.skipTo())..

        By making both Filter and Weight/Compiler subclasses of Query, it became
        possible to simplify the Searchable/Searcher interface considerably.

        > the downside of using ANDQuery (KS) for filtering is that it has to provide
        > a score value, which somehow must be ignored during search.

        Good point, thanks!

        Right now, QueryFilter, NOTQuery, MatchAllQuery, and such all
        just provide a score of 0.0. ANDScorer adds the scores of its subscorers
        together, so there's no direct effect on the final score. However, the
        Similarity.coord() bonus is affected because the number of clauses has
        increased. That might be considered a bug.

        Beyond that, there's Scorer-compile-time optimization work to do along the
        lines of what Uwe proposes.

        Show
        Marvin Humphrey added a comment - Paul Elschot: > To prepare searching in Lucene the following 'transformations' are done: > > Query -> Weight -> Scorer > and > Filter -> DocIdSetIterator > > I've never seen the KS classes KS's Search classes used to be pretty direct ports from the Lucene Search hierarchy – because when I was doing the work I had so much trouble grokking it that I felt there was no choice but to cargo cult. Since then, many changes have been made. Here are some that are germane to this discussion: Filter has been eliminated, and filtering subclasses have been made into subclasses of Query. Weight has been made a subclass of Query and renamed to "Compiler". BooleanQuery has been replaced by the foursome of ANDQuery, ORQuery, NOTQuery, and RequiredOptionalQuery, all of which descend from the common parent PolyQuery, and which compile to scorers roughly akin to those from Lucene's BooleanScorer2 (i.e. they implement Scorer.skipTo()).. By making both Filter and Weight/Compiler subclasses of Query, it became possible to simplify the Searchable/Searcher interface considerably. > the downside of using ANDQuery (KS) for filtering is that it has to provide > a score value, which somehow must be ignored during search. Good point, thanks! Right now, QueryFilter, NOTQuery, MatchAllQuery, and such all just provide a score of 0.0. ANDScorer adds the scores of its subscorers together, so there's no direct effect on the final score. However, the Similarity.coord() bonus is affected because the number of clauses has increased. That might be considered a bug. Beyond that, there's Scorer-compile-time optimization work to do along the lines of what Uwe proposes.
        Hide
        Uwe Schindler added a comment -

        Paul Elschot:

        Just for clarification: I do not want to completely convert Filters to ConstantScoreQueries. The idea was to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a "default" weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery. There is only one special case: If the Filter is used alone in the BooleanQuery, then it must be executed as a ConstantScoreQuery, but only in this case. The problems with sorting are in my opinion not relevant: If score is identical (e.g. 1.0f) the results come in index order (this is how it appears to me). In this case TopDocs first list the docs with lower docIds.

        For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB

        Is this clear now?

        Show
        Uwe Schindler added a comment - Paul Elschot: Just for clarification: I do not want to completely convert Filters to ConstantScoreQueries. The idea was to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a "default" weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery. There is only one special case: If the Filter is used alone in the BooleanQuery, then it must be executed as a ConstantScoreQuery, but only in this case. The problems with sorting are in my opinion not relevant: If score is identical (e.g. 1.0f) the results come in index order (this is how it appears to me). In this case TopDocs first list the docs with lower docIds. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB Is this clear now?
        Hide
        Paul Elschot added a comment -

        To add a Filter is as a clause to a BooleanQuery, I would prefer to not give it a Weight. Instead I'd like the addition of a required Filter to behave exactly like the current Searcher(Query, Filter) API.
        That also touches another point: backward compatibility with BooleanQuery and Searcher.

        It's certainly possible to add scoring behaviour to a Filter when it is added to a BooleanQuery. A default score value could be used, and also a default coordination behaviour.
        In principle it is also possible to add a disjunction of Filters to a BooleanQuery, even with a minimum number of required filters. For this case a score value does make sense.

        Required Filters and for prohibited Filters could be added to a BooleanQuery without scoring behaviour. In fact, for prohibited Queries, the score value is never used, so one might even constrain prohibited clauses to be Filters only.

        Most, if not all, of the scoring behaviour for Filters that was discussed so far can be obtained by using a ConstantScoreQuery based on a Filter and adding it to a BooleanQuery. So I think it would be cleaner to keep the scoring yes/no distinction between Queries and Filters. In case a simplified interface is desired this could then use any of the options available, for example always wrapping a Filter in a ConstantScoreQuery, and then composing a BooleanQuery only from Query clauses.

        Show
        Paul Elschot added a comment - To add a Filter is as a clause to a BooleanQuery, I would prefer to not give it a Weight. Instead I'd like the addition of a required Filter to behave exactly like the current Searcher(Query, Filter) API. That also touches another point: backward compatibility with BooleanQuery and Searcher. It's certainly possible to add scoring behaviour to a Filter when it is added to a BooleanQuery. A default score value could be used, and also a default coordination behaviour. In principle it is also possible to add a disjunction of Filters to a BooleanQuery, even with a minimum number of required filters. For this case a score value does make sense. Required Filters and for prohibited Filters could be added to a BooleanQuery without scoring behaviour. In fact, for prohibited Queries, the score value is never used, so one might even constrain prohibited clauses to be Filters only. Most, if not all, of the scoring behaviour for Filters that was discussed so far can be obtained by using a ConstantScoreQuery based on a Filter and adding it to a BooleanQuery. So I think it would be cleaner to keep the scoring yes/no distinction between Queries and Filters. In case a simplified interface is desired this could then use any of the options available, for example always wrapping a Filter in a ConstantScoreQuery, and then composing a BooleanQuery only from Query clauses.
        Hide
        Earwin Burrfoot added a comment -

        What about complete merge of filters/queries, and deciding whether to score/use constant score/don't score when adding a query to BooleanQuery (or AND/OR/NOT alternative)?
        Something along the lines of: boolQuery.add(new TermQuery(..), SHOULD, NO_SCORE)

        Show
        Earwin Burrfoot added a comment - What about complete merge of filters/queries, and deciding whether to score/use constant score/don't score when adding a query to BooleanQuery (or AND/OR/NOT alternative)? Something along the lines of: boolQuery.add(new TermQuery(..), SHOULD, NO_SCORE)
        Hide
        Paul Elschot added a comment - - edited

        This:

        boolQuery.add(new TermQuery(..), SHOULD, NO_SCORE)

        can be done (with the patch here applied) by:

        boolQuery.add(new QueryWrapperFilter(new TermQuery(..), MUST) .

        (SHOULD cannot be used for filters as clauses).

        I'll post a working version of the patch within a few days. It's better to discuss on working code than on ideas only.

        Show
        Paul Elschot added a comment - - edited This: boolQuery.add(new TermQuery(..), SHOULD, NO_SCORE) can be done (with the patch here applied) by: boolQuery.add(new QueryWrapperFilter(new TermQuery(..), MUST) . (SHOULD cannot be used for filters as clauses). I'll post a working version of the patch within a few days. It's better to discuss on working code than on ideas only.
        Hide
        Marvin Humphrey added a comment -

        > (SHOULD cannot be used for filters as clauses).

        It doesn't have to be that way. In KS, QueryFilter is a Query, which you can add as a clause to an ORQuery or a RequiredOptionalQuery. Docs which match only the QueryFilter are fed to the HitCollector with a score of 0.0.

        Show
        Marvin Humphrey added a comment - > (SHOULD cannot be used for filters as clauses). It doesn't have to be that way. In KS, QueryFilter is a Query, which you can add as a clause to an ORQuery or a RequiredOptionalQuery. Docs which match only the QueryFilter are fed to the HitCollector with a score of 0.0.
        Hide
        Marvin Humphrey added a comment -

        Uwe Schindler:

        > Maybe I should create an new JIRA issue out of my suggestion to merge
        > Filters and Queries? In my opinion, this is something nice to have in 3.0.

        I agree with this tack, having taken it in KS. However, I don't think we have
        consensus as far as the best approach yet, so perhaps it would be beneficial
        to hash things out on the mailing list first.

        Show
        Marvin Humphrey added a comment - Uwe Schindler: > Maybe I should create an new JIRA issue out of my suggestion to merge > Filters and Queries? In my opinion, this is something nice to have in 3.0. I agree with this tack, having taken it in KS. However, I don't think we have consensus as far as the best approach yet, so perhaps it would be beneficial to hash things out on the mailing list first.
        Hide
        Doug Cutting added a comment -

        Uwe> Maybe I should create an new JIRA issue out of my suggestion to merge Filters and Queries?

        +1 to creating a new issue and +1 to the idea.

        Show
        Doug Cutting added a comment - Uwe> Maybe I should create an new JIRA issue out of my suggestion to merge Filters and Queries? +1 to creating a new issue and +1 to the idea.
        Hide
        Uwe Schindler added a comment -

        I created and linked a new issue LUCENE-1518, that handles the merge suggestion. I also included all relevant comments from me about this.

        Show
        Uwe Schindler added a comment - I created and linked a new issue LUCENE-1518 , that handles the merge suggestion. I also included all relevant comments from me about this.
        Hide
        Paul Elschot added a comment -

        Ok, I'll wait for LUCENE-1518.

        Show
        Paul Elschot added a comment - Ok, I'll wait for LUCENE-1518 .
        Hide
        Paul Elschot added a comment -

        An interim update (22 March 2009) to the patch for this issue.

        The patch pushes the evaluation of required filter clauses into ConjunctionScorer, which should bring a small speed improvement to filtered searches. This also gives an alternative to the current use of filters in the Searcher API.

        However, the patch still fails some test cases, most notably TestBoolean2.testQueries06 which has two excluding query clauses. This indicates that there is a bug somewhere in/around the DisiDocQueue class that is added by the patch. Some of the random query tests also fail. AFAIK all failing test cases have excluding query clauses.
        So I think all tests will pass once this bug is solved.

        Another issue with this is more duplication of code originating from PriorityQueue in the DisiDocQueue class.

        I don't really like to post a patch with a known bug, but I can spend so little time on this issue in the near future that I prefer to give someone else the 'opportunity' to complete this.

        In case the patch does not apply cleanly please let me know.

        Show
        Paul Elschot added a comment - An interim update (22 March 2009) to the patch for this issue. The patch pushes the evaluation of required filter clauses into ConjunctionScorer, which should bring a small speed improvement to filtered searches. This also gives an alternative to the current use of filters in the Searcher API. However, the patch still fails some test cases, most notably TestBoolean2.testQueries06 which has two excluding query clauses. This indicates that there is a bug somewhere in/around the DisiDocQueue class that is added by the patch. Some of the random query tests also fail. AFAIK all failing test cases have excluding query clauses. So I think all tests will pass once this bug is solved. Another issue with this is more duplication of code originating from PriorityQueue in the DisiDocQueue class. I don't really like to post a patch with a known bug, but I can spend so little time on this issue in the near future that I prefer to give someone else the 'opportunity' to complete this. In case the patch does not apply cleanly please let me know.
        Hide
        Jason Rutherglen added a comment -

        Even though Paul's patch doesn't pass a test, it sounds like it
        can be benchmarked? Part of the goal of patch is to make certain
        queries faster? This could help with how we approach optimizing
        LUCENE-1345.

        Show
        Jason Rutherglen added a comment - Even though Paul's patch doesn't pass a test, it sounds like it can be benchmarked? Part of the goal of patch is to make certain queries faster? This could help with how we approach optimizing LUCENE-1345 .
        Hide
        Paul Elschot added a comment -

        The interesting thing to benchmark is filtered queries. One could do this by adding the filter as a required clause to a BooleanQuery in IndexSearcher, and see whether filtered queries are faster with that implementation. This part should work normally with the current patch.

        In case that turns out to make a real difference, it might also be considered to deprecate all the Searcher methods that take a Filter argument, and indicate the preferred alternative implementation with a Filter as a clause to BooleanQuery in the javadocs.

        Now, if I could find the time to get this last bug out of the current patch...

        Show
        Paul Elschot added a comment - The interesting thing to benchmark is filtered queries. One could do this by adding the filter as a required clause to a BooleanQuery in IndexSearcher, and see whether filtered queries are faster with that implementation. This part should work normally with the current patch. In case that turns out to make a real difference, it might also be considered to deprecate all the Searcher methods that take a Filter argument, and indicate the preferred alternative implementation with a Filter as a clause to BooleanQuery in the javadocs. Now, if I could find the time to get this last bug out of the current patch...
        Hide
        Paul Elschot added a comment -

        Since I don't think I'll get to providing an improved patch anytime soon, I'm setting the fix version to 3.1.

        The recent changes to DocIdSetIterator conflict with the patch here (not very much), and the patch here failed some test cases even before this conflict.

        Show
        Paul Elschot added a comment - Since I don't think I'll get to providing an improved patch anytime soon, I'm setting the fix version to 3.1. The recent changes to DocIdSetIterator conflict with the patch here (not very much), and the patch here failed some test cases even before this conflict.
        Hide
        Paul Elschot added a comment -

        Enough time has passed to show that there is not sufficient interestin this.

        Show
        Paul Elschot added a comment - Enough time has passed to show that there is not sufficient interestin this.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development