Lucene - Core
  1. Lucene - Core
  2. LUCENE-2868

It should be easy to make use of TermState; rewritten queries should be shared automatically

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermState addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically.

      This can be most readily (and powerfully) done by introducing a new method to Query.java:

      Query rewriteUsingCache(IndexReader indexReader)

      ... and including a caching implementation right in Query.java which would then work for all. Of course, all callers would want to use this new method rather than the current rewrite().

      1. query-rewriter.patch
        37 kB
        Simon Willnauer
      2. lucene-2868.patch
        64 kB
        Karl Wright
      3. lucene-2868.patch
        78 kB
        Karl Wright
      4. LUCENE-2868.patch
        35 kB
        Karl Wright
      5. LUCENE-2868.patch
        37 kB
        Karl Wright
      6. LUCENE-2868.patch
        35 kB
        Karl Wright

        Activity

        Hide
        Karl Wright added a comment -

        Updated for latest trunk

        Show
        Karl Wright added a comment - Updated for latest trunk
        Hide
        Simon Willnauer added a comment -

        LUCENE-3041 has been opened as a follow-up on this issue.

        Show
        Simon Willnauer added a comment - LUCENE-3041 has been opened as a follow-up on this issue.
        Hide
        Chris Male added a comment -

        I've opened LUCENE-3041 to work on the suggestions made by Earwin.

        Show
        Chris Male added a comment - I've opened LUCENE-3041 to work on the suggestions made by Earwin.
        Hide
        Karl Wright added a comment -

        Missed a file in the patch

        Show
        Karl Wright added a comment - Missed a file in the patch
        Hide
        Karl Wright added a comment -

        Simon's patch, updated to latest trunk

        Show
        Karl Wright added a comment - Simon's patch, updated to latest trunk
        Hide
        Karl Wright added a comment -

        Oops, forgot to add a key file.
        Seriously, the weight caching is of minor utility. The scorer caching is not enabled. So all that this patch does differently is try to define a broader concept of query context, rather than the narrow fix Simon proposes.

        Show
        Karl Wright added a comment - Oops, forgot to add a key file. Seriously, the weight caching is of minor utility. The scorer caching is not enabled. So all that this patch does differently is try to define a broader concept of query context, rather than the narrow fix Simon proposes.
        Hide
        Simon Willnauer added a comment -

        Here's my take on the patch, including ability to cache weight objects.

        I have a couple of comments here - first I can not apply your patch to the current trunk can you update it?

        • you keep a cache per IndexSeacher (btw. QueryDataCache is missing in the patch) which is used to cache several things across searches. This is very dangerous! While I don't know how it is implemented I would guess you need to synchronized access to it so it would slow down searches ey?
        • Caching Scorers is going to break since Scorers are stateful and might be advanced to different documents. Yet, I can see what you are trying to do here since doing work in a scorer is costly so common TermQueries for instance should not need to load the same posting list twice. There are two things which come to my mind right away. 1. Postinglist caching - should be done on a codec level IMO 2. Building PerReaderTermState only once for a common TermQuery. While caching PostingLists is going to be tricky and quite a task reusing PerReaderTermState could work fine as far as I can see if you are in the same searcher.
        • Caching Weights is kind of weird - what is the reason for this again? The only thing you really save here is setup costs which are generally very low.

        Overall I don' t like that this way you tightly couple something to Weight / Query etc. for a single purpose what could be solved with some kind of query optimization phase similar to what I had in my last patch and Earwin has proposed. I think we should not tight couple things like that into lucene. This is really extremely application dependent in the most cases and we should only provide the infrastructure to do it.

        Earwin - I think we should make a new issue and get something like that implemented in there which is more general than what I just sketched out. If you could share your code that would be awesome!

        Earwin, any new on this - shall I open an issue for that?

        It occurs to me that the name of the common class that gets created in IndexSearcher and passed around should probably be named something more appropriate, like QueryContext. That way people will feel free to extend it to hold all sorts of query-local data, in time. Thoughts?

        You refer to ScorerContext? This class was actually not intended to be expendable its public final until now. I am not sure if we should open that up though.

        Show
        Simon Willnauer added a comment - Here's my take on the patch, including ability to cache weight objects. I have a couple of comments here - first I can not apply your patch to the current trunk can you update it? you keep a cache per IndexSeacher (btw. QueryDataCache is missing in the patch) which is used to cache several things across searches. This is very dangerous! While I don't know how it is implemented I would guess you need to synchronized access to it so it would slow down searches ey? Caching Scorers is going to break since Scorers are stateful and might be advanced to different documents. Yet, I can see what you are trying to do here since doing work in a scorer is costly so common TermQueries for instance should not need to load the same posting list twice. There are two things which come to my mind right away. 1. Postinglist caching - should be done on a codec level IMO 2. Building PerReaderTermState only once for a common TermQuery. While caching PostingLists is going to be tricky and quite a task reusing PerReaderTermState could work fine as far as I can see if you are in the same searcher. Caching Weights is kind of weird - what is the reason for this again? The only thing you really save here is setup costs which are generally very low. Overall I don' t like that this way you tightly couple something to Weight / Query etc. for a single purpose what could be solved with some kind of query optimization phase similar to what I had in my last patch and Earwin has proposed. I think we should not tight couple things like that into lucene. This is really extremely application dependent in the most cases and we should only provide the infrastructure to do it. Earwin - I think we should make a new issue and get something like that implemented in there which is more general than what I just sketched out. If you could share your code that would be awesome! Earwin, any new on this - shall I open an issue for that? It occurs to me that the name of the common class that gets created in IndexSearcher and passed around should probably be named something more appropriate, like QueryContext. That way people will feel free to extend it to hold all sorts of query-local data, in time. Thoughts? You refer to ScorerContext? This class was actually not intended to be expendable its public final until now. I am not sure if we should open that up though.
        Hide
        Karl Wright added a comment -

        It occurs to me that the name of the common class that gets created in IndexSearcher and passed around should probably be named something more appropriate, like QueryContext. That way people will feel free to extend it to hold all sorts of query-local data, in time. Thoughts?

        Show
        Karl Wright added a comment - It occurs to me that the name of the common class that gets created in IndexSearcher and passed around should probably be named something more appropriate, like QueryContext. That way people will feel free to extend it to hold all sorts of query-local data, in time. Thoughts?
        Hide
        Karl Wright added a comment -

        Here's my take on the patch, including ability to cache weight objects.

        Show
        Karl Wright added a comment - Here's my take on the patch, including ability to cache weight objects.
        Hide
        Simon Willnauer added a comment -

        I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java.

        Earwin - I think we should make a new issue and get something like that implemented in there which is more general than what I just sketched out. If you could share your code that would be awesome!

        Show
        Simon Willnauer added a comment - I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java. Earwin - I think we should make a new issue and get something like that implemented in there which is more general than what I just sketched out. If you could share your code that would be awesome!
        Hide
        Earwin Burrfoot added a comment -

        We here use an intermediate query AST, with a number of walkers that do synonym substitution, optimization, caching, rewriting for multiple fields, and finally - generating a tree of Lucene Queries.

        I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java.
        Usage looks roughly like:

        class ToStringWalker extends DispatchingVisitor<String> { // String here stands for the type of walk result
          String visit(TermQuery q) {
            return "{term: " + q.getTerm() + "}";
          }
        
          String visit(BooleanQuery q) {
            StringBuffer buf = new StringBuffer();
            buf.append("{boolean: ");
            for (BooleanQuery.Clause clause: q.clauses()) {
              buf.append(dispatch(clause.getQuery()).append(", "); // Here we 
            }
            buf.append("}");
            return buf.toString();
          }
        
          String visit(SpanQuery q) { // Runs for all SpanQueries
            .....
          }
        
          String visit(Query q) { // Runs for all Queries not covered by a more exact visit() method 
            ......
          }
        }
        
        Query query = ...;
        String stringRepresentation = new ToStringWalker().dispatch(query);
        

        dispatch() checks its parameter runtime type, picks a visit()'s most close overload (according to java rules for compile-time overloaded method resolution), and invokes it.

        Show
        Earwin Burrfoot added a comment - We here use an intermediate query AST, with a number of walkers that do synonym substitution, optimization, caching, rewriting for multiple fields, and finally - generating a tree of Lucene Queries. I can share a generic reflection-based visitor that's somewhat more handy than default visitor pattern in java. Usage looks roughly like: class ToStringWalker extends DispatchingVisitor< String > { // String here stands for the type of walk result String visit(TermQuery q) { return "{term: " + q.getTerm() + "}" ; } String visit(BooleanQuery q) { StringBuffer buf = new StringBuffer (); buf.append( "{ boolean : " ); for (BooleanQuery.Clause clause: q.clauses()) { buf.append(dispatch(clause.getQuery()).append( ", " ); // Here we } buf.append( "}" ); return buf.toString(); } String visit(SpanQuery q) { // Runs for all SpanQueries ..... } String visit(Query q) { // Runs for all Queries not covered by a more exact visit() method ...... } } Query query = ...; String stringRepresentation = new ToStringWalker().dispatch(query); dispatch() checks its parameter runtime type, picks a visit()'s most close overload (according to java rules for compile-time overloaded method resolution), and invokes it.
        Hide
        Simon Willnauer added a comment -

        I just sketched out what I have in mind could solve this problem and create the infrastructure to do way more than just caching a query#rewrite.
        This patch (which is just a sketch to show what I have in mind) adds a QueryRewriter class that walks the Query "AST" and rewrites each query "node" in the tree. The default implementation does nothing special, it just forwards to the queryies rewrite method but there seems to be a whole lot of potential in such a tree-walker / visitor. For instance could we subclass it to optimize certain queries if we fix the coord problem. Yet another usecase is to decouple MTQ rewriter entirely from MTQ (not sure if we want that though) or somebody wants to wrap a query during rewrite.

        Even further somebody could rewrite against fieldcache? Maybe this can be even more general and just be a QueryVisitor so folks can easily walk the tree.

        I think this is really something that should be solved in general AND in a different issue.

        simon

        Show
        Simon Willnauer added a comment - I just sketched out what I have in mind could solve this problem and create the infrastructure to do way more than just caching a query#rewrite. This patch (which is just a sketch to show what I have in mind) adds a QueryRewriter class that walks the Query "AST" and rewrites each query "node" in the tree. The default implementation does nothing special, it just forwards to the queryies rewrite method but there seems to be a whole lot of potential in such a tree-walker / visitor. For instance could we subclass it to optimize certain queries if we fix the coord problem. Yet another usecase is to decouple MTQ rewriter entirely from MTQ (not sure if we want that though) or somebody wants to wrap a query during rewrite. Even further somebody could rewrite against fieldcache? Maybe this can be even more general and just be a QueryVisitor so folks can easily walk the tree. I think this is really something that should be solved in general AND in a different issue. simon
        Hide
        Karl Wright added a comment -

        I reworded the description.

        I think the word "cache" is correct, but what we really need is simply a cache that has the lifetime of a top-level rewrite. I agree that putting the data in the query object itself would not have this characteristic, but on the other hand a second Query method that is cache aware seems reasonable. For example:

        Query rewriteMinimal(RewriteCache rc, IndexReader ir)

        ... where RewriteCache was an object that had a lifetime consistent with the highest-level rewrite operation done on the query graph. The rewriteMinimal() method would look for the rewrite of the the current query in the RewriteCache, and if found, would return that, otherwise would call plain old rewrite() and then save the result.

        So the patch would include:
        (a) the change as specified to Query.java
        (b) an implementation of RewriteCache, which could just be simplified to Map<Query,Query>
        (c) changes to the callers of rewrite(), so that the minimal rewrite was called instead.

        Thoughts?

        Show
        Karl Wright added a comment - I reworded the description. I think the word "cache" is correct, but what we really need is simply a cache that has the lifetime of a top-level rewrite. I agree that putting the data in the query object itself would not have this characteristic, but on the other hand a second Query method that is cache aware seems reasonable. For example: Query rewriteMinimal(RewriteCache rc, IndexReader ir) ... where RewriteCache was an object that had a lifetime consistent with the highest-level rewrite operation done on the query graph. The rewriteMinimal() method would look for the rewrite of the the current query in the RewriteCache, and if found, would return that, otherwise would call plain old rewrite() and then save the result. So the patch would include: (a) the change as specified to Query.java (b) an implementation of RewriteCache, which could just be simplified to Map<Query,Query> (c) changes to the callers of rewrite(), so that the minimal rewrite was called instead. Thoughts?
        Hide
        Simon Willnauer added a comment -

        Actually, I think we need to clarify the description of this issue. This has nothing todo with TermCache at all. It actually reads very scary though since caches are really tricky and this one is mainly about rewrite cost in MTQ. This said, adding a method to Query just for the sake of MTQ rewrite seems kind of odd though. We should rather optimize the query structure somehow instead of caching a rewrite method.

        Show
        Simon Willnauer added a comment - Actually, I think we need to clarify the description of this issue. This has nothing todo with TermCache at all. It actually reads very scary though since caches are really tricky and this one is mainly about rewrite cost in MTQ. This said, adding a method to Query just for the sake of MTQ rewrite seems kind of odd though. We should rather optimize the query structure somehow instead of caching a rewrite method.
        Hide
        Simon Willnauer added a comment -

        Who would create the RewriteCache object? The IndexSearcher?

        it could.. or just be an overloaded IS.search method

        Show
        Simon Willnauer added a comment - Who would create the RewriteCache object? The IndexSearcher? it could.. or just be an overloaded IS.search method
        Hide
        Karl Wright added a comment -

        Fine by me if you have a better way of doing it!

        Who would create the RewriteCache object? The IndexSearcher?

        Show
        Karl Wright added a comment - Fine by me if you have a better way of doing it! Who would create the RewriteCache object? The IndexSearcher?
        Hide
        Simon Willnauer added a comment -

        When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically.

        First of all, I get nervous when it gets to stuff like this! Hence, I can see when this could be useful, for instance if you have one and the same FuzzyQuery / RegexpQuery which has a rather large setup cost in more than one clause in a boolean query then this would absolutely help. For other queryies like TermQuery the TermState cache in TermsEnum already helps you a lot so for those this wouldn't make a big difference though.

        Query rewriteUsingCache(IndexReader indexReader)

        I think one major issue here is how would you clear a cache here. WeakReferences would work but I would't to put any cache into any query. In general we shouldn't make any query heavy weight or "somewhate stateful" at all. Yet, if we would pass a RewriteCache into Query#rewrite(IR, RC) that has a per IS#search lifetime this could actually work. This would also be easy to implement Query#rewrite(IR, RC) would just forward to Query#rewrite(IR) for by default and combining (BooleanQuery) queries could override the new one. Eventually, MultiTermQuery can provide such an impl and check the cache if it needs to rewrite itself or return an already rewritten version.

        Show
        Simon Willnauer added a comment - When you have the same query in a query hierarchy multiple times, tremendous savings can now be had if the user knows enough to share the rewritten queries in the hierarchy, due to the TermCache addition. But this is clumsy and requires a lot of coding by the user to take advantage of. Lucene should be smart enough to share the rewritten queries automatically. First of all, I get nervous when it gets to stuff like this! Hence, I can see when this could be useful, for instance if you have one and the same FuzzyQuery / RegexpQuery which has a rather large setup cost in more than one clause in a boolean query then this would absolutely help. For other queryies like TermQuery the TermState cache in TermsEnum already helps you a lot so for those this wouldn't make a big difference though. Query rewriteUsingCache(IndexReader indexReader) I think one major issue here is how would you clear a cache here. WeakReferences would work but I would't to put any cache into any query. In general we shouldn't make any query heavy weight or "somewhate stateful" at all. Yet, if we would pass a RewriteCache into Query#rewrite(IR, RC) that has a per IS#search lifetime this could actually work. This would also be easy to implement Query#rewrite(IR, RC) would just forward to Query#rewrite(IR) for by default and combining (BooleanQuery) queries could override the new one. Eventually, MultiTermQuery can provide such an impl and check the cache if it needs to rewrite itself or return an already rewritten version.

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Karl Wright
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development