Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query.

      The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query.

      I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a "default" weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345).

      Here some ideas how to implement Searcher.search() with Query and Filter:

      • User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents.
      • User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before
      • User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters.

      For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB.

      1. LUCENE-1518.patch
        9 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Yeah, thanks for touching the issue, so it gets back into my mind.

          Of course the patch non this issue is heavily outdated and the "simple" aproach of making Filter extend ConstantScoreQuery is no longer applicable as implemented there (because Filters can now be applied as random access bits down-low), so making them a query like proposed here would make them stop using RA bits. The better approach would be (as Eks Dev pointed out on LUCENE-4548) to mark the "Filter" as "non-scoring" and then scorers from e.g. BooleanWeight can optimize on that case (and possibly use bits instead of leap-frogging iterators).

          The idea in this patch to make the "Filter" class still available to users (as a convenience way to create a non-scoring query without having to implement a full query with weight and scorers) is what I like from this issue. But a Filter is just a "standard" query and can be used anywhere like e.g. in BooleanQuery. The main work to implement this would be in BooleanQuery to implement the features of FilteredQuery (which would go away together with QueryWrapperFilter and similar classes).

          Show
          Uwe Schindler added a comment - Yeah, thanks for touching the issue, so it gets back into my mind. Of course the patch non this issue is heavily outdated and the "simple" aproach of making Filter extend ConstantScoreQuery is no longer applicable as implemented there (because Filters can now be applied as random access bits down-low), so making them a query like proposed here would make them stop using RA bits. The better approach would be (as Eks Dev pointed out on LUCENE-4548 ) to mark the "Filter" as "non-scoring" and then scorers from e.g. BooleanWeight can optimize on that case (and possibly use bits instead of leap-frogging iterators). The idea in this patch to make the "Filter" class still available to users (as a convenience way to create a non-scoring query without having to implement a full query with weight and scorers) is what I like from this issue. But a Filter is just a "standard" query and can be used anywhere like e.g. in BooleanQuery. The main work to implement this would be in BooleanQuery to implement the features of FilteredQuery (which would go away together with QueryWrapperFilter and similar classes).
          Hide
          David Smiley added a comment -

          This issue may seem like ancient history but I'm nonetheless looking forward to it being realized. I think it will really simplify things. Do we re-target this at 5.0/trunk?

          Show
          David Smiley added a comment - This issue may seem like ancient history but I'm nonetheless looking forward to it being realized. I think it will really simplify things. Do we re-target this at 5.0/trunk?
          Hide
          Uwe Schindler added a comment -

          Push to 3.1! – Uwe

          Show
          Uwe Schindler added a comment - Push to 3.1! – Uwe
          Hide
          Mark Miller added a comment -

          This issue is marked as part of LUCENE-1345, which has been pushed to 3.1. Also, it has not yet found an assignee. Speak out, or I will push this to 3.1.

          Show
          Mark Miller added a comment - This issue is marked as part of LUCENE-1345 , which has been pushed to 3.1. Also, it has not yet found an assignee. Speak out, or I will push this to 3.1.
          Hide
          Eks Dev added a comment -

          Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values...

          Me: ...Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?)...

          I knew Paul will kick in at this place, he sad exactly the same thing I did, but, as oposed to me, he made formulation that executes
          Pfff, I feel bad

          Show
          Eks Dev added a comment - Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values... Me: ...Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?)... I knew Paul will kick in at this place, he sad exactly the same thing I did, but, as oposed to me, he made formulation that executes Pfff, I feel bad
          Hide
          Eks Dev added a comment -

          Shai,
          ----Regarding pure ranked, CSQ is really what we need, no? —

          Yep, it would work for Filters, but why not making it possible to have normal Query "constant score". For these cases, I am just not sure if this aproach gets max performance (did not look at this code for quite a while).

          Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?)

          Paul,
          ---How about materializing the DocIds and the score values?---
          exactly, that would open full caching posibility (original purpose of Filters). Think Search Results caching ... that is practically another name for search() method. It is easy to create this, but using it again would require some bigger changes

          Filter_on_Steroids materialize(boolean without_score);

          Show
          Eks Dev added a comment - Shai, ----Regarding pure ranked, CSQ is really what we need, no? — Yep, it would work for Filters, but why not making it possible to have normal Query "constant score". For these cases, I am just not sure if this aproach gets max performance (did not look at this code for quite a while). Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?) Paul, --- How about materializing the DocIds and the score values? --- exactly, that would open full caching posibility (original purpose of Filters). Think Search Results caching ... that is practically another name for search() method. It is easy to create this, but using it again would require some bigger changes Filter_on_Steroids materialize(boolean without_score);
          Hide
          Paul Elschot added a comment - - edited

          Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the "no scoring" mode, unless "no scoring" can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f).

          The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values.
          Using score=0.0f for no scoring might not work for BooleanQuery because it also has a coordination factor that depends on the number of matching query clauses. The patch at 1345 does not change that coordination factor for backward compatibility, even though the coordination factor might also depend on the number of a matching filter clauses.

          Show
          Paul Elschot added a comment - - edited Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the "no scoring" mode, unless "no scoring" can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values. Using score=0.0f for no scoring might not work for BooleanQuery because it also has a coordination factor that depends on the number of matching query clauses. The patch at 1345 does not change that coordination factor for backward compatibility, even though the coordination factor might also depend on the number of a matching filter clauses.
          Hide
          Paul Elschot added a comment - - edited

          opening option to "materialize Query" (that is how we today create Filters today)

          How about materializing the DocIds and the score values?

          Show
          Paul Elschot added a comment - - edited opening option to "materialize Query" (that is how we today create Filters today) How about materializing the DocIds and the score values?
          Hide
          Michael McCandless added a comment -

          Let's not forget we also have "provides scores but NOT filtering" type things as well, eg function queries, MatchAllDocsQuery, "I want to boost documents by recency" use case (which sort of a Scorer filter in that it takes another Scorer and modifies its output, per doc), etc.

          It's just that very often the "scoring part" is in fact very much intertwined with the "filtering" part. EG a TermQuery iterates a SegmentTermDocs, and reads & holds freq/doc in pairs.

          Show
          Michael McCandless added a comment - Let's not forget we also have "provides scores but NOT filtering" type things as well, eg function queries, MatchAllDocsQuery, "I want to boost documents by recency" use case (which sort of a Scorer filter in that it takes another Scorer and modifies its output, per doc), etc. It's just that very often the "scoring part" is in fact very much intertwined with the "filtering" part. EG a TermQuery iterates a SegmentTermDocs, and reads & holds freq/doc in pairs.
          Hide
          Shai Erera added a comment -

          Wrapping with CSQ is just adding anothe layer between Lucene search machinery and Filter, making these optimizations harder.

          Right. But making Filter sub-class Query and check in BQ 'if (query instanceof Filter) { Filter f = (Filter) query)' is not going to improve anything. It adds instanceof and casting, and I'd think those are more expensive than wrapping a Filter with CSQ and returning an appropriate Scorer, which will use the Filter in its next() and skipTo() calls.

          On the other hand, I must accept, conceptually FIter and Query are "the same", supporting together following options

          I think that if we allow BooleanClause to implement a Weight(IndexReader) (just like Query) we'll be one more step closer to that goal? BQ uses this method to construct BooleanWeight, only today it calls clause.getQuery().createWeight(). Instead it could do clause.getWeight, and if the BooleanClause holds a Filter it will return a FilterWeight, otherwise delegate that call to the contained Query.

          Regarding pure ranked, CSQ is really what we need, no?

          So how about the following:

          1. Add add(Filter, Occur) to BooleanClause.
          2. Add weight(Searcher) to BooleanClause.
          3. Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the "no scoring" mode, unless "no scoring" can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f).
          4. Add isRandomAccess to Filter.
          5. Create a RandomAccessFilter which extends Filter and defines an additional seek(target) method.
          6. Add asRandomAccessFilter() to Filter, which will materialize that Filter into memory, or into another RandomAccess data structure (e.g. keeping it on disk but still provide random access to it, even if not very efficient) and return a RandomAccessFilter type, which will implement seek(target) and possibly override next() and skipTo(), but still use whatever other methods this Filter declares.
            • I think we should default it to throw UOE providing that we document that isRandomAccess should first be called.

          I'm thinking out loud just like you, so I hope my stuff makes sense .

          Show
          Shai Erera added a comment - Wrapping with CSQ is just adding anothe layer between Lucene search machinery and Filter, making these optimizations harder. Right. But making Filter sub-class Query and check in BQ 'if (query instanceof Filter) { Filter f = (Filter) query)' is not going to improve anything. It adds instanceof and casting, and I'd think those are more expensive than wrapping a Filter with CSQ and returning an appropriate Scorer, which will use the Filter in its next() and skipTo() calls. On the other hand, I must accept, conceptually FIter and Query are "the same", supporting together following options I think that if we allow BooleanClause to implement a Weight(IndexReader) (just like Query) we'll be one more step closer to that goal? BQ uses this method to construct BooleanWeight, only today it calls clause.getQuery().createWeight(). Instead it could do clause.getWeight, and if the BooleanClause holds a Filter it will return a FilterWeight, otherwise delegate that call to the contained Query. Regarding pure ranked, CSQ is really what we need, no? So how about the following: Add add(Filter, Occur) to BooleanClause. Add weight(Searcher) to BooleanClause. Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the "no scoring" mode, unless "no scoring" can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). Add isRandomAccess to Filter. Create a RandomAccessFilter which extends Filter and defines an additional seek(target) method. Add asRandomAccessFilter() to Filter, which will materialize that Filter into memory, or into another RandomAccess data structure (e.g. keeping it on disk but still provide random access to it, even if not very efficient) and return a RandomAccessFilter type, which will implement seek(target) and possibly override next() and skipTo(), but still use whatever other methods this Filter declares. I think we should default it to throw UOE providing that we document that isRandomAccess should first be called. I'm thinking out loud just like you, so I hope my stuff makes sense .
          Hide
          Eks Dev added a comment -

          imo, it is really not all that important to make Filter and Query the same (that is just one alternative to achieve goal).

          Basic problem we try to solve is adding Filter directly to BoolenQuery, and making optimizations after that easier. Wrapping with CSQ is just adding anothe layer between Lucene search machinery and Filter, making these optimizations harder.

          On the other hand, I must accept, conceptually FIter and Query are "the same", supporting together following options:
          1. Pure boolean model: You do not care about scores (today we can do it only wia CSQ, as Filter does not enter BoolenQuery)
          2. Mixed boolean and ranked: you have to define Filter contribution to the documents (CSQ)
          3. Pure ranked: No filters, all gets scored (the same as 2.)

          Ideally, as a user, I define only Query (Filter based or not) and for each clause in my Query define
          Query.setScored(true/false) or useConstantScore(double score);

          also I should be able to say, "Dear Lucene please materialize this "Query_Filter" for me as I would like to have it cached and please store only DocIds (Filter today). Maybe open possibility to open possibility to cache scores of the documents as well.

          one thing is concept and another is optimization. From optimization point of view, we have couple of decisions to make:

          • DocID Set supports random access, yes or no (my "Materialized Query")
          • Decide if clause should / should not be scored/ or should be constant

          So, for each "Query" we need to decide/support:

          • scoring {yes, no, constant}

            and

          • opening option to "materialize Query" (that is how we today create Filters today)
          • these Materialized Queries (aka Filter) should be able to tell us if they support random access, if they cache only doc id's or scores as well

          nothing usefull in this email, just thinking aloud, sometimes helps

          Show
          Eks Dev added a comment - imo, it is really not all that important to make Filter and Query the same (that is just one alternative to achieve goal). Basic problem we try to solve is adding Filter directly to BoolenQuery, and making optimizations after that easier. Wrapping with CSQ is just adding anothe layer between Lucene search machinery and Filter, making these optimizations harder. On the other hand, I must accept, conceptually FIter and Query are "the same", supporting together following options: 1. Pure boolean model: You do not care about scores (today we can do it only wia CSQ, as Filter does not enter BoolenQuery) 2. Mixed boolean and ranked: you have to define Filter contribution to the documents (CSQ) 3. Pure ranked: No filters, all gets scored (the same as 2.) Ideally, as a user, I define only Query (Filter based or not) and for each clause in my Query define Query.setScored(true/false) or useConstantScore(double score); also I should be able to say, "Dear Lucene please materialize this "Query_Filter" for me as I would like to have it cached and please store only DocIds (Filter today). Maybe open possibility to open possibility to cache scores of the documents as well. one thing is concept and another is optimization. From optimization point of view, we have couple of decisions to make: DocID Set supports random access, yes or no (my "Materialized Query") Decide if clause should / should not be scored/ or should be constant So, for each "Query" we need to decide/support: scoring {yes, no, constant} and opening option to "materialize Query" (that is how we today create Filters today) these Materialized Queries (aka Filter) should be able to tell us if they support random access, if they cache only doc id's or scores as well nothing usefull in this email, just thinking aloud, sometimes helps
          Hide
          Shai Erera added a comment -

          I would like to query why do we need to make Filter and Query of the same type? After all, they both do different things, even though it looks like they are similar. Attempting to do this yields those peculiarities:

          1. If Filter extends Query, it now has to implement all sorts of methods like weight, toString, rewrite, getTerms and scoresDocInOrder (an addition from LUCENE-1593).
          2. If Query extends Filter, it has to implement getDocIdSet.
          3. Introduce instanceof checks in places just to check if a given Query is actually a Filter or not.

          Both (1) and (2) are completely redundant for both Query and Filter, i.e. why should Filter implement toString(term) or scoresDocInOrder when it does score docs? Why should Query implement getDocIdSet when it already implements a weight().scorer() which returns a DocIdSetIterator?

          I read the different posts on this issue and I don't understand why we think that the API is not clear enough today, or is not convenient:

          • If I want to just filter the entire index, I have two ways: (1) execute a search with MatchAllDocsQuery and a Filter (2) Wrap a filter with ConstantScoreQuery. I don't see the difference between the two, and I don't think it forces any major/difficult decision on the user.
          • If I want to have a BooleanQuery with several clauses and I want a clause to be a complex one with a Filter, I can wrap the Filter with CSQ.
          • If I want to filter a Query, there is already API today on Searcher which accepts both Query and Filter.

          At least as I understand it, Queries are supposed to score documents, while Filters to just filter. If there is an API which requires Queries only, then I can wrap my Filter with CSQ, but I'd prefer to check if we can change that API first (for example, allowing BooleanClause to accept a Filter, and implement a weight(IndexReader) rather than just getQuery()).

          So if Filters just filter and Queries just score, the API on both is very clear: Filter returns a DISI and Query returns a Scorer (which is also a DISI). I don't see the advantage of having the code unaware to the fact a certain Query is actually a Fitler - I prefer it to be upfront. That way, we can do all sorts of optimizations, like asking the Filter for next() first, if we know it's supposed to filter most of the documents.

          At the end of the day, both Filter and Query iterate on documents. The difference lies in the purpose of iteration. In my code there are several Query implementations today that just filter documents, and I plan to change all of them to implement Filter instead (that was originally the case because Filter had just bits() and now it's more efficient with the iterator() version, at least to me). I want to do this for a couple of reasons, clarity being one of the most important. If Filter just filters, I don't see why it should inherit all the methods from Query (or vice versa BTW), especially when I have this CSQ wrapper.
          To me, as a Lucene user, I make far more complicated decisions every day than deciding whether I want to use a Filter as a Query or not. If I pass it directly to IndexSearcher, I use it as a filter. If I use a different API which accepts just Query, I wrap it with CSQ. As simple as that.

          But that's just my two cents.

          Show
          Shai Erera added a comment - I would like to query why do we need to make Filter and Query of the same type? After all, they both do different things, even though it looks like they are similar. Attempting to do this yields those peculiarities: If Filter extends Query, it now has to implement all sorts of methods like weight, toString, rewrite, getTerms and scoresDocInOrder (an addition from LUCENE-1593 ). If Query extends Filter, it has to implement getDocIdSet. Introduce instanceof checks in places just to check if a given Query is actually a Filter or not. Both (1) and (2) are completely redundant for both Query and Filter, i.e. why should Filter implement toString(term) or scoresDocInOrder when it does score docs? Why should Query implement getDocIdSet when it already implements a weight().scorer() which returns a DocIdSetIterator? I read the different posts on this issue and I don't understand why we think that the API is not clear enough today, or is not convenient: If I want to just filter the entire index, I have two ways: (1) execute a search with MatchAllDocsQuery and a Filter (2) Wrap a filter with ConstantScoreQuery. I don't see the difference between the two, and I don't think it forces any major/difficult decision on the user. If I want to have a BooleanQuery with several clauses and I want a clause to be a complex one with a Filter, I can wrap the Filter with CSQ. If I want to filter a Query, there is already API today on Searcher which accepts both Query and Filter. At least as I understand it, Queries are supposed to score documents, while Filters to just filter. If there is an API which requires Queries only, then I can wrap my Filter with CSQ, but I'd prefer to check if we can change that API first (for example, allowing BooleanClause to accept a Filter, and implement a weight(IndexReader) rather than just getQuery()). So if Filters just filter and Queries just score, the API on both is very clear: Filter returns a DISI and Query returns a Scorer (which is also a DISI). I don't see the advantage of having the code unaware to the fact a certain Query is actually a Fitler - I prefer it to be upfront. That way, we can do all sorts of optimizations, like asking the Filter for next() first, if we know it's supposed to filter most of the documents. At the end of the day, both Filter and Query iterate on documents. The difference lies in the purpose of iteration. In my code there are several Query implementations today that just filter documents, and I plan to change all of them to implement Filter instead (that was originally the case because Filter had just bits() and now it's more efficient with the iterator() version, at least to me). I want to do this for a couple of reasons, clarity being one of the most important. If Filter just filters, I don't see why it should inherit all the methods from Query (or vice versa BTW), especially when I have this CSQ wrapper. To me, as a Lucene user, I make far more complicated decisions every day than deciding whether I want to use a Filter as a Query or not. If I pass it directly to IndexSearcher, I use it as a filter. If I use a different API which accepts just Query, I wrap it with CSQ. As simple as that. But that's just my two cents.
          Hide
          Michael McCandless added a comment -

          I really hate the hasRandomAccess() approach from an OO design standpoint, but I have to admit that I don't have anything better.

          I think we need to approach this as a structural optimization problem;
          I think there should be an "optimize()" step after (or during)
          rewrite(). The optimize phase would structure the matching to run as
          quickly as possible.

          Ie, on detecting somehow that a filter is random access, we should
          multiply it out (down) to each TermScorer. If deletes are
          pre-multiplied in the filter, we tell each TermScorer not to check
          deletes.

          [We may need a filter manager to go along w/ this, eg that will
          convert a filter to random-access (if it's going to be reused),
          multiply in deletes (and re-do that whenever new reader is opened),
          etc.]

          Likewise, LUCENE-1252 splits matching of queries that consult
          positional information into two steps (roughly "cheap" and
          "expensive") and does all "cheap" tests across each "and" clause and
          "expensive" only when necessary. So optimize() would return two
          matchers for such queries, and we'd "collate" the cheap matchers
          together first, followed by the expensive one.

          Not requiring an implicit next() after skipTo() ... so optimize would
          decide which matchers should "drive" the iteration, and which others
          should do random-access test. Some next()'s (eg OR or AND matchers)
          are far more costly than other next()'s (eg, TermScorer). Some are
          far more restrictive than others, etc.

          Of course, some filters require iterator access, so clearly we must
          accept that.

          At some point, there will be too much splintering of options and
          source code specialization should [somehow] take over in enumerating
          all the combinations. EG the field-sort collectors are already
          getting close to this (record score or not, compute max score or not,
          single field vs multi field, docID required for tie breaking or not,
          etc).

          Show
          Michael McCandless added a comment - I really hate the hasRandomAccess() approach from an OO design standpoint, but I have to admit that I don't have anything better. I think we need to approach this as a structural optimization problem; I think there should be an "optimize()" step after (or during) rewrite(). The optimize phase would structure the matching to run as quickly as possible. Ie, on detecting somehow that a filter is random access, we should multiply it out (down) to each TermScorer. If deletes are pre-multiplied in the filter, we tell each TermScorer not to check deletes. [We may need a filter manager to go along w/ this, eg that will convert a filter to random-access (if it's going to be reused), multiply in deletes (and re-do that whenever new reader is opened), etc.] Likewise, LUCENE-1252 splits matching of queries that consult positional information into two steps (roughly "cheap" and "expensive") and does all "cheap" tests across each "and" clause and "expensive" only when necessary. So optimize() would return two matchers for such queries, and we'd "collate" the cheap matchers together first, followed by the expensive one. Not requiring an implicit next() after skipTo() ... so optimize would decide which matchers should "drive" the iteration, and which others should do random-access test. Some next()'s (eg OR or AND matchers) are far more costly than other next()'s (eg, TermScorer). Some are far more restrictive than others, etc. Of course, some filters require iterator access, so clearly we must accept that. At some point, there will be too much splintering of options and source code specialization should [somehow] take over in enumerating all the combinations. EG the field-sort collectors are already getting close to this (record score or not, compute max score or not, single field vs multi field, docID required for tie breaking or not, etc).
          Hide
          Marvin Humphrey added a comment -

          > Have we agreed on how the class structure is going to look and what the
          > optimizations will look like?

          I don't think this problem is solved.

          I thought I had come up with a grand unifying OO design solution with my
          Matcher proposal for Lucy. Matcher was to be the base class for any matching
          iterator, whether the task was scoring, filtering, or deletions (a specialized
          kind of filter). A provisional implementation has been completed in KS svn
          trunk.

          Unfortunately, McCandless's benchmarks on iterated vs. random access filters
          have blown a hole in the Matcher approach. Whether it can be mended, or
          whether starting over from scratch is the best idea, I don't know. I really
          hate the hasRandomAccess() approach from an OO design standpoint, but I have
          to admit that I don't have anything better. However, I do think that it's
          essential for any Filter refactoring to answer Mike's challenge in full.

          Right now, we don't have search-time benchmarking capabilities for Lucy/KS,
          and I don't think it's possible for me, at least, to pursue any further
          experimentation until we acquire those capabilities. For the time being, I've
          turned my attention to other concerns, and I don't expect to push this issue
          forward for a little while.

          Show
          Marvin Humphrey added a comment - > Have we agreed on how the class structure is going to look and what the > optimizations will look like? I don't think this problem is solved. I thought I had come up with a grand unifying OO design solution with my Matcher proposal for Lucy. Matcher was to be the base class for any matching iterator, whether the task was scoring, filtering, or deletions (a specialized kind of filter). A provisional implementation has been completed in KS svn trunk. Unfortunately, McCandless's benchmarks on iterated vs. random access filters have blown a hole in the Matcher approach. Whether it can be mended, or whether starting over from scratch is the best idea, I don't know. I really hate the hasRandomAccess() approach from an OO design standpoint, but I have to admit that I don't have anything better. However, I do think that it's essential for any Filter refactoring to answer Mike's challenge in full. Right now, we don't have search-time benchmarking capabilities for Lucy/KS, and I don't think it's possible for me, at least, to pursue any further experimentation until we acquire those capabilities. For the time being, I've turned my attention to other concerns, and I don't expect to push this issue forward for a little while.
          Hide
          Uwe Schindler added a comment - - edited

          My opinion is that the attached patch has most backwards-compatibility (with some small toString() issues), but it makes Filter a subclass of query with the default constant score logic. Further work to remove the extra Filter classes for MultiTermQueries (RangeFilter, PrefixFilter,...) and so on can be done later or as part of this patch (did'nt want to start with this before being sure, that the API looks good).

          This will not make anything faster, but make the API for users simplier. No longer differentiate between query and filter.

          The optimization can be part of LUCENE-1345. In this case, BooleanQuery API does not need to be changed (no extra Filter clauses, because Filters can be added directly as normal Query clauses). The optimizations from 1345 then could directltly test with instanceof, if something is a filter.

          WAS: One problem is Fuzzy Query that can also be used as Filter (subclass of MultiTermQuery), but there scoring is important.

          EDIT:
          This is not a problem: MultiTermQuery will also be a filter and vice versa. It depends on what rewrite does. If ConstantScoreMode is on, it will return itsself (through super.rewrite()), if it rewrites to boolean query, it should do it. In this case the result of rewrite is no longer a filter, and filter optimization do not apply. So important with this new API is, that rewrite is called first (even for filters, but as filters are queries now, this would automatically be done). The optimizations in 1345 then would only get the result of the rewrite (and instanceof checks would be on the rewritten one, which maybe a filter or not).

          Show
          Uwe Schindler added a comment - - edited My opinion is that the attached patch has most backwards-compatibility (with some small toString() issues), but it makes Filter a subclass of query with the default constant score logic. Further work to remove the extra Filter classes for MultiTermQueries (RangeFilter, PrefixFilter,...) and so on can be done later or as part of this patch (did'nt want to start with this before being sure, that the API looks good). This will not make anything faster, but make the API for users simplier. No longer differentiate between query and filter. The optimization can be part of LUCENE-1345 . In this case, BooleanQuery API does not need to be changed (no extra Filter clauses, because Filters can be added directly as normal Query clauses). The optimizations from 1345 then could directltly test with instanceof, if something is a filter. WAS: One problem is Fuzzy Query that can also be used as Filter (subclass of MultiTermQuery), but there scoring is important. EDIT: This is not a problem: MultiTermQuery will also be a filter and vice versa. It depends on what rewrite does. If ConstantScoreMode is on, it will return itsself (through super.rewrite()), if it rewrites to boolean query, it should do it. In this case the result of rewrite is no longer a filter, and filter optimization do not apply. So important with this new API is, that rewrite is called first (even for filters, but as filters are queries now, this would automatically be done). The optimizations in 1345 then would only get the result of the rewrite (and instanceof checks would be on the rewritten one, which maybe a filter or not).
          Hide
          Jason Rutherglen added a comment -

          What's the status of this patch? Have we agreed on how the class
          structure is going to look and what the optimizations will look
          like?

          Show
          Jason Rutherglen added a comment - What's the status of this patch? Have we agreed on how the class structure is going to look and what the optimizations will look like?
          Hide
          Uwe Schindler added a comment - - edited

          This patch does not want to completely merge queries and filters. It only gives the possibility to use a Filter directly as a query. So ValueSourceQuery will stay direct subclass of Query and not of Filter and can be used as scorer only. I think completely merging both is not a good idea.

          But for easy usage, the idea of making filter automatically work as query (using a constant score algorithm) may be good. So queries (e.g. without matching, also conventional queries) can stay alive, but filters can be used as query (and BooleanQuery could automatically have filters as clauses, and is able to optimize and factor away scoring).

          The user only has an easy understandable API and would not be affected with thinking about "is it better to use a ConstantScoreQuery or should I just filter the results of the other boolean query clauses? What if I only have the filter, do I need a MatchAllDocsQuery and filter it, or better use ConstantScore in this case?".

          Show
          Uwe Schindler added a comment - - edited This patch does not want to completely merge queries and filters. It only gives the possibility to use a Filter directly as a query. So ValueSourceQuery will stay direct subclass of Query and not of Filter and can be used as scorer only. I think completely merging both is not a good idea. But for easy usage, the idea of making filter automatically work as query (using a constant score algorithm) may be good. So queries (e.g. without matching, also conventional queries) can stay alive, but filters can be used as query (and BooleanQuery could automatically have filters as clauses, and is able to optimize and factor away scoring). The user only has an easy understandable API and would not be affected with thinking about "is it better to use a ConstantScoreQuery or should I just filter the results of the other boolean query clauses? What if I only have the filter, do I need a MatchAllDocsQuery and filter it, or better use ConstantScore in this case?".
          Hide
          Michael McCandless added a comment -

          At first blush, this patch seems backwards: shouldn't the base class
          be a Filter (which does "pure matching" with no scoring) and then
          subclass of that (Query) would add in scoring?

          But, then I think I appreciate why we did this: it's because we accept
          Query all over the place now, so it's a far less disruptive change to
          have Filter be the subclass.

          Or... we might want to allow scoring without matching, so maybe they
          should be fully independent?

          EG when I want to dynamically boost relevance scores (eg by recency,
          popularity, personalizaztion, etc.), I would want to make a "pure
          scorer" that has absolutely no matching role, whose score gets "mixed
          in" based on the sub-query's boost.

          ValueSourceQuery is an example: it has a "degenerate" matcher (matches
          all docs) whose only purpose is to produce custom per-doc scoring.

          Shouldn't I be able to add a "purer scorer" (no matching) as a clause
          onto BooleanQuery?

          Show
          Michael McCandless added a comment - At first blush, this patch seems backwards: shouldn't the base class be a Filter (which does "pure matching" with no scoring) and then subclass of that (Query) would add in scoring? But, then I think I appreciate why we did this: it's because we accept Query all over the place now, so it's a far less disruptive change to have Filter be the subclass. Or... we might want to allow scoring without matching, so maybe they should be fully independent? EG when I want to dynamically boost relevance scores (eg by recency, popularity, personalizaztion, etc.), I would want to make a "pure scorer" that has absolutely no matching role, whose score gets "mixed in" based on the sub-query's boost. ValueSourceQuery is an example: it has a "degenerate" matcher (matches all docs) whose only purpose is to produce custom per-doc scoring. Shouldn't I be able to add a "purer scorer" (no matching) as a clause onto BooleanQuery?
          Hide
          Paul Elschot added a comment - - edited

          I don't think the time is lost. In case there is a way to really simplify/unify Query and Filter, I want to know, too.

          Btw. there is no instanceof check in the patch at LUCENE-1345. The reason for that is that adding a Filter as a clause is done by another method than the method to add a Query, so the compiler can deal with the difference.

          Show
          Paul Elschot added a comment - - edited I don't think the time is lost. In case there is a way to really simplify/unify Query and Filter, I want to know, too. Btw. there is no instanceof check in the patch at LUCENE-1345 . The reason for that is that adding a Filter as a clause is done by another method than the method to add a Query, so the compiler can deal with the difference.
          Hide
          Uwe Schindler added a comment -

          I have no problem with this. I was waiting for more comments on this befor starting to factor out ConstantScoreQuery from the rest of the core/contrib lasses. In principle the current patch is backwards compatible, so you could also use it as basis for your work. In the case that my idea is not the way to go, this may be lost time (but: as always queries are rewritten before execution, the unneeded ConstantScoreQuery rewrites to the filter, and your instanceof check in BooleanScorer/-Query should correctly detect all filters).

          Show
          Uwe Schindler added a comment - I have no problem with this. I was waiting for more comments on this befor starting to factor out ConstantScoreQuery from the rest of the core/contrib lasses. In principle the current patch is backwards compatible, so you could also use it as basis for your work. In the case that my idea is not the way to go, this may be lost time (but: as always queries are rewritten before execution, the unneeded ConstantScoreQuery rewrites to the filter, and your instanceof check in BooleanScorer/-Query should correctly detect all filters).
          Hide
          Paul Elschot added a comment -

          Even now that this has had some time, I still think LUCENE-1345 should go before this, and implement Filter clauses to BooleanQuery using the current simple Filter in a backward compatible way.

          Then, in case the resulting BooleanQuery really turns out to be too complicated for general use, this issue could be taken up again.

          Show
          Paul Elschot added a comment - Even now that this has had some time, I still think LUCENE-1345 should go before this, and implement Filter clauses to BooleanQuery using the current simple Filter in a backward compatible way. Then, in case the resulting BooleanQuery really turns out to be too complicated for general use, this issue could be taken up again.
          Hide
          Paul Elschot added a comment -

          The option is needed because the coordination factor in the result score depends on the number of matching query clauses.

          Also, without the option, a check will have to be done whether the score is constant and 0. I'm not even sure whether that would always work for subclasses of Filter that change the score to a possibly non constant value.

          Show
          Paul Elschot added a comment - The option is needed because the coordination factor in the result score depends on the number of matching query clauses. Also, without the option, a check will have to be done whether the score is constant and 0. I'm not even sure whether that would always work for subclasses of Filter that change the score to a possibly non constant value.
          Hide
          Uwe Schindler added a comment -

          That means that to add a Filter to a BooleanQuery as a clause, the user will need an option whether or not to have the Filter take part in the scoring.

          Why is this needed? A constant score query does not effect the scoring at all. Why not simply optimize the scoreing of filters away in BooleanQuery in all cases (if there are other queries available in the BooleanQuery that do scoring)?

          Show
          Uwe Schindler added a comment - That means that to add a Filter to a BooleanQuery as a clause, the user will need an option whether or not to have the Filter take part in the scoring. Why is this needed? A constant score query does not effect the scoring at all. Why not simply optimize the scoreing of filters away in BooleanQuery in all cases (if there are other queries available in the BooleanQuery that do scoring)?
          Hide
          Paul Elschot added a comment -

          Broadly, the patch moves the internal ConstantScorer class from ConstantScoreQuery to Filter.

          That means that to add a Filter to a BooleanQuery as a clause, the user will need an option whether or not to have the Filter take part in the scoring.
          So, with this for BooleanQuery an option of scoring/non scoring in the API for adding clauses will be needed,
          whereas without this the possibility to add a Filter will be needed. I don't care either way.

          Are there other places where a Query is used now in which a Filter could also be useful?
          And are there other use cases for ConstantScoreQuery than on top of a Filter?

          Show
          Paul Elschot added a comment - Broadly, the patch moves the internal ConstantScorer class from ConstantScoreQuery to Filter. That means that to add a Filter to a BooleanQuery as a clause, the user will need an option whether or not to have the Filter take part in the scoring. So, with this for BooleanQuery an option of scoring/non scoring in the API for adding clauses will be needed, whereas without this the possibility to add a Filter will be needed. I don't care either way. Are there other places where a Query is used now in which a Filter could also be useful? And are there other use cases for ConstantScoreQuery than on top of a Filter?
          Hide
          Uwe Schindler added a comment - - edited

          In my opinion, both approches could be combined. I do not know how the scoring and the whole BooleanQery works, I did only the merging of the API. If we have consensus, that this may be good, I could remove the rest of deprecated ConstantScoreQuery. The new code then works without any problems and backwards compatible as before, but with no optimization.

          Then Paul could adapt his patch to not create new methods to add Filter clauses to BooleanQueries, but do just some difference in the whole BooleanQuery logic like that:

          if (clause.query instanceof Filter)

          { do only filter optimization from Paul's patch }

          else

          { do conventional boolean scoring }

          If he cannot do the optimization (because the Filter is alone in BooleanQuery) he could just fall back to the standard query logic (that uses implicit the current ConstantScoreQuery algorithm).

          But before start implementing more for removing the deprecated class, I wanted to hear some more ideas, maybe something completely different (like every query can also automatically be a filter): Only one superclass "Query" no filter anymore, every query clause can implement a filter and/or a query with always a Fallback to the other side (if no filter implementation provided, a filter is provided by a algorithm like QueryFilter; if no weight/rewrite, constant score weight is provided).

          Just ideas...

          Show
          Uwe Schindler added a comment - - edited In my opinion, both approches could be combined. I do not know how the scoring and the whole BooleanQery works, I did only the merging of the API. If we have consensus, that this may be good, I could remove the rest of deprecated ConstantScoreQuery. The new code then works without any problems and backwards compatible as before, but with no optimization. Then Paul could adapt his patch to not create new methods to add Filter clauses to BooleanQueries, but do just some difference in the whole BooleanQuery logic like that: if (clause.query instanceof Filter) { do only filter optimization from Paul's patch } else { do conventional boolean scoring } If he cannot do the optimization (because the Filter is alone in BooleanQuery) he could just fall back to the standard query logic (that uses implicit the current ConstantScoreQuery algorithm). But before start implementing more for removing the deprecated class, I wanted to hear some more ideas, maybe something completely different (like every query can also automatically be a filter): Only one superclass "Query" no filter anymore, every query clause can implement a filter and/or a query with always a Fallback to the other side (if no filter implementation provided, a filter is provided by a algorithm like QueryFilter; if no weight/rewrite, constant score weight is provided). Just ideas...
          Hide
          Jason Rutherglen added a comment -

          Hopefully this will allow caching of individual term queries even if they are a part of a boolean query?

          Show
          Jason Rutherglen added a comment - Hopefully this will allow caching of individual term queries even if they are a part of a boolean query?
          Hide
          Paul Elschot added a comment -

          There used to be places that filtered anything with a 0.0 score however. Are any of those left?

          They should be gone by now, even from the javadocs.

          Show
          Paul Elschot added a comment - There used to be places that filtered anything with a 0.0 score however. Are any of those left? They should be gone by now, even from the javadocs.
          Hide
          Eks Dev added a comment -

          nice,
          you did it top down (api), Paul takes it bottom up (speed).

          this makes some really crazy things possible, e.g. implementing normal TermQuery as a "DirectFilter" and when the optimization of the BooleanQuery gets done (no Score calculation, direct usage of DocIdSetIterators) you can speed up some queries containing TermQuery without really instantiating Filter. Of course only for cases where tf/idf/norm can be ignored.

          Kind of middle-ground between Filter and full ranked TermQuery (better said any BooleanQuery!), Faster than ranked case due to the switched off score calculation and more comfortable than Filter usage, no instantiation of DocIdSet-s...

          very nice indeed, smooth mix between ranked and "pure boolean" model with both benefits.

          Show
          Eks Dev added a comment - nice, you did it top down (api), Paul takes it bottom up (speed). this makes some really crazy things possible, e.g. implementing normal TermQuery as a "DirectFilter" and when the optimization of the BooleanQuery gets done (no Score calculation, direct usage of DocIdSetIterators) you can speed up some queries containing TermQuery without really instantiating Filter. Of course only for cases where tf/idf/norm can be ignored. Kind of middle-ground between Filter and full ranked TermQuery (better said any BooleanQuery!), Faster than ranked case due to the switched off score calculation and more comfortable than Filter usage, no instantiation of DocIdSet-s... very nice indeed, smooth mix between ranked and "pure boolean" model with both benefits.
          Hide
          Uwe Schindler added a comment -

          Why 1.0 and not 0.0? I chose to use 0.0 in KS because then a Filter would effectively perform only binary filtering and never affect scores.

          You are right. I was coming from the ConstantScoreQuery that has a score of 1.0 in the result.

          Perhaps your envisioned query optimization algo ensures that the Filter will only serve as a DocIDSetIterator unless it's used as a top-level Query? Can we agree that a Filter used as a sub-clause in a complex query should not contribute to the aggregate score?

          UPDATE: A closer reading reveals that the original issue text contains the answer to this question (we agree), so the only remaining question is whether a top level query would return hits with scores of 0.0 or 1.0. Which is probably a bikeshed painting issue.

          This is exactly what I was thinking about. But it is not only a top level query. In the case of a BooleanQuery containing only one clause that is a filter, the constant score implementation must also used (as a Filter alone is useless).

          ConstantScoreQuery returns 1.0 if executed alone or alone in a BooleanQuery.

          For backwards compatibility, I think we should just use the following logic:
          A filter is a query, but can also be used as a query (if alone). The default implementation for this is a constant score query as currently when wrapping a filter with ConstantScoreQuery. In all other cases (combined with other boolean clauses and not alone), the score calculation is removed and you can say 0.0f (if additive).

          Show
          Uwe Schindler added a comment - Why 1.0 and not 0.0? I chose to use 0.0 in KS because then a Filter would effectively perform only binary filtering and never affect scores. You are right. I was coming from the ConstantScoreQuery that has a score of 1.0 in the result. Perhaps your envisioned query optimization algo ensures that the Filter will only serve as a DocIDSetIterator unless it's used as a top-level Query? Can we agree that a Filter used as a sub-clause in a complex query should not contribute to the aggregate score? UPDATE: A closer reading reveals that the original issue text contains the answer to this question (we agree), so the only remaining question is whether a top level query would return hits with scores of 0.0 or 1.0. Which is probably a bikeshed painting issue. This is exactly what I was thinking about. But it is not only a top level query. In the case of a BooleanQuery containing only one clause that is a filter, the constant score implementation must also used (as a Filter alone is useless). ConstantScoreQuery returns 1.0 if executed alone or alone in a BooleanQuery. For backwards compatibility, I think we should just use the following logic: A filter is a query, but can also be used as a query (if alone). The default implementation for this is a constant score query as currently when wrapping a filter with ConstantScoreQuery. In all other cases (combined with other boolean clauses and not alone), the score calculation is removed and you can say 0.0f (if additive).
          Hide
          Doug Cutting added a comment -

          > Why 1.0 and not 0.0?

          0.0 does seem more appropriate, since scores are typically added, not multiplied. There used to be places that filtered anything with a 0.0 score however. Are any of those left?

          Show
          Doug Cutting added a comment - > Why 1.0 and not 0.0? 0.0 does seem more appropriate, since scores are typically added, not multiplied. There used to be places that filtered anything with a 0.0 score however. Are any of those left?
          Hide
          Marvin Humphrey added a comment - - edited

          > the query can be executed and returns score 1.0 for all matching documents

          Why 1.0 and not 0.0? I chose to use 0.0 in KS because then a Filter would effectively perform only binary filtering and never affect scores.

          Perhaps your envisioned query optimization algo ensures that the Filter will only serve as a DocIDSetIterator unless it's used as a top-level Query? Can we agree that a Filter used as a sub-clause in a complex query should not contribute to the aggregate score?

          UPDATE: A closer reading reveals that the original issue text contains the answer to this question (we agree), so the only remaining question is whether a top level query would return hits with scores of 0.0 or 1.0. Which is probably a bikeshed painting issue.

          Show
          Marvin Humphrey added a comment - - edited > the query can be executed and returns score 1.0 for all matching documents Why 1.0 and not 0.0? I chose to use 0.0 in KS because then a Filter would effectively perform only binary filtering and never affect scores. Perhaps your envisioned query optimization algo ensures that the Filter will only serve as a DocIDSetIterator unless it's used as a top-level Query? Can we agree that a Filter used as a sub-clause in a complex query should not contribute to the aggregate score? UPDATE: A closer reading reveals that the original issue text contains the answer to this question (we agree), so the only remaining question is whether a top level query would return hits with scores of 0.0 or 1.0. Which is probably a bikeshed painting issue.
          Hide
          Uwe Schindler added a comment -

          Further patches must now remove the deprecated ConstantScoreQuery from the core and contrib classes (RangeQuery gets identical to RangeFilter and so on).

          The rewrite method (inherited from the Filter API) may then be changed in RangeQuery to return a BooleanQuery when useConstantScoreRewrite==false. In this case the RangeFilter gets the same as the Query and when used as Query, can also rewrite to TermQueries. But as I know, this is to be removed in Lucene 3.0. In this case every RangeFilter and the others can simply be constant score or filters.

          Show
          Uwe Schindler added a comment - Further patches must now remove the deprecated ConstantScoreQuery from the core and contrib classes (RangeQuery gets identical to RangeFilter and so on). The rewrite method (inherited from the Filter API) may then be changed in RangeQuery to return a BooleanQuery when useConstantScoreRewrite==false. In this case the RangeFilter gets the same as the Query and when used as Query, can also rewrite to TermQueries. But as I know, this is to be removed in Lucene 3.0. In this case every RangeFilter and the others can simply be constant score or filters.
          Hide
          Uwe Schindler added a comment -

          This is the patch.

          Most tests pass. Problems are only in the tests that check the explanations (TestSimpleExplanations) when wrapping the deprecated ConstantScoreQuery. For that the test must be rewritten to directly get the explanation from the Filter class (that is now a query). ConstantScoreQuery is just a no-op that rewrites to the wrapped Filter itsself but returns no weight and no explanation.

          Some small problems are the handling of toString()/toString(fieldname). The abstract Filter class has no toString(fieldname), but Query has one (but abstract). So there must be a default or Field implementations must be extended to provide one (bc break). The patch uses some bad workaround for that, maybe somebody has a better idea how to handle this.

          Show
          Uwe Schindler added a comment - This is the patch. Most tests pass. Problems are only in the tests that check the explanations (TestSimpleExplanations) when wrapping the deprecated ConstantScoreQuery. For that the test must be rewritten to directly get the explanation from the Filter class (that is now a query). ConstantScoreQuery is just a no-op that rewrites to the wrapped Filter itsself but returns no weight and no explanation. Some small problems are the handling of toString()/toString(fieldname). The abstract Filter class has no toString(fieldname), but Query has one (but abstract). So there must be a default or Field implementations must be extended to provide one (bc break). The patch uses some bad workaround for that, maybe somebody has a better idea how to handle this.

            People

            • Assignee:
              Unassigned
              Reporter:
              Uwe Schindler
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development