Lucene - Core
  1. Lucene - Core
  2. LUCENE-2590

Enable access to the freq information in a Query's sub-scorers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The ability to gather more details than just the score, of how a given
      doc matches the current query, has come up a number of times on the
      user's lists. (most recently in the thread "Query Match Count" by
      Ryan McV on java-user).

      EG if you have a simple TermQuery "foo", on each hit you'd like to
      know how many times "foo" occurred in that doc; or a BooleanQuery +foo
      +bar, being able to separately see the freq of foo and bar for the
      current hit.

      Lucene doesn't make this possible today, which is a shame because
      Lucene in fact does compute exactly this information; it's just not
      accessible from the Collector.

      1. LUCENE-2590.patch
        16 kB
        Michael McCandless
      2. LUCENE-2590.patch
        19 kB
        Michael McCandless
      3. LUCENE-2590.patch
        30 kB
        Simon Willnauer
      4. LUCENE-2590.patch
        26 kB
        Simon Willnauer
      5. LUCENE-2590.patch
        32 kB
        Simon Willnauer
      6. LUCENE-2590.patch
        31 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Attached patch.

          There were two things to fix. First, I added Scorer.freq() so that
          you could get the freq of the current match. Second, I added a way to
          visit all sub-scorers for a given Scorer. The functionality is
          definitely expert level – you'll have to make a custom Collector that
          gathers these stats & saves them away in your PQ, to use this "for
          real". But it's a good start I think; there's a simple test case
          showing how it works.

          Show
          Michael McCandless added a comment - Attached patch. There were two things to fix. First, I added Scorer.freq() so that you could get the freq of the current match. Second, I added a way to visit all sub-scorers for a given Scorer. The functionality is definitely expert level – you'll have to make a custom Collector that gathers these stats & saves them away in your PQ, to use this "for real". But it's a good start I think; there's a simple test case showing how it works.
          Hide
          Michael McCandless added a comment -

          Another iter of this patch; I think it's ready to commit. I commented out BooleanScorer's handling of required=true since we never use it (but left TODOs in there to re-enable this code if/when we go back to sending required clauses to BooleanScorer). I also added CHANGES entry.

          Show
          Michael McCandless added a comment - Another iter of this patch; I think it's ready to commit. I commented out BooleanScorer's handling of required=true since we never use it (but left TODOs in there to re-enable this code if/when we go back to sending required clauses to BooleanScorer). I also added CHANGES entry.
          Hide
          Simon Willnauer added a comment -

          Mike, eventually we are getting this feature - I guess many are waiting for this!

          I wonder if we really need to pass the BooleanClause.Occur to the visit method of if we rather defined a visit method for each boolean relationship. We could make it an abstract class and if you are not interested in Prohibited or Optional scorers you can simply leave the method empty. JIT compiler might be able to optimize this method calls away. if we go this way I'd make VisitSubScorer abstract

          public abstract class VisitSubScorer {
             public void visitRequired(Query parent, Query child, Scorer childScorer){}
             public void visitOptional(Query parent, Query child, Scorer childScorer){}
             public void visitProhibited(Query parent, Query child, Scorer childScorer){}
          }
          

          I we could also make parent and child <P extends Query, C extends Query, S extends Scorer> in certain (I have to admit rather rare) situations this could make users code simpler as they might know how their queries look like and the cast is done implicitly. For all others nothing really changes though. Its just an idea and I thought I'd mention it.

          Show
          Simon Willnauer added a comment - Mike, eventually we are getting this feature - I guess many are waiting for this! I wonder if we really need to pass the BooleanClause.Occur to the visit method of if we rather defined a visit method for each boolean relationship. We could make it an abstract class and if you are not interested in Prohibited or Optional scorers you can simply leave the method empty. JIT compiler might be able to optimize this method calls away. if we go this way I'd make VisitSubScorer abstract public abstract class VisitSubScorer { public void visitRequired(Query parent, Query child, Scorer childScorer){} public void visitOptional(Query parent, Query child, Scorer childScorer){} public void visitProhibited(Query parent, Query child, Scorer childScorer){} } I we could also make parent and child <P extends Query, C extends Query, S extends Scorer> in certain (I have to admit rather rare) situations this could make users code simpler as they might know how their queries look like and the cast is done implicitly. For all others nothing really changes though. Its just an idea and I thought I'd mention it.
          Hide
          Michael McCandless added a comment -

          OK I like the 3 methods instead of the single method with BooleanClause.OCCUR param.

          The heavy generics would be useful for apps that eg have a fixed structure to all Queries right? Eg if you know your app only ever makes BooleanQuery w/ TermQuery sub-classes, you'd parameterize the instantiation as <BooleanQuery,TermQuery,TermScrorer> (hmm though TermScorer isn't public; maybe just leave S as Scorer)? I think this makes sense.

          I'll work on these but it'll be a little while – I'm out on "vacation".

          Show
          Michael McCandless added a comment - OK I like the 3 methods instead of the single method with BooleanClause.OCCUR param. The heavy generics would be useful for apps that eg have a fixed structure to all Queries right? Eg if you know your app only ever makes BooleanQuery w/ TermQuery sub-classes, you'd parameterize the instantiation as <BooleanQuery,TermQuery,TermScrorer> (hmm though TermScorer isn't public; maybe just leave S as Scorer)? I think this makes sense. I'll work on these but it'll be a little while – I'm out on "vacation".
          Hide
          Simon Willnauer added a comment -

          The heavy generics would be useful for apps that eg have a fixed structure to all Queries right? Eg if you know your app only ever makes BooleanQuery w/ TermQuery sub-classes, you'd parameterize the instantiation as <BooleanQuery,TermQuery,TermScrorer> (hmm though TermScorer isn't public; maybe just leave S as Scorer)? I think this makes sense.

          jep, this is exactly what I was aiming at. Scorer might be an exception but in general this it what you would do. A special case where I could think of is FuzzyQuery where you can tell from the rewrite method what kind of query will be constructed and where this information could be useful (think of spellchecking of similar usecases)

          I'll work on these but it'll be a little while - I'm out on "vacation".

          Good to hear, enjoy! I might update you patch accordingly before you come back though!

          simon

          Show
          Simon Willnauer added a comment - The heavy generics would be useful for apps that eg have a fixed structure to all Queries right? Eg if you know your app only ever makes BooleanQuery w/ TermQuery sub-classes, you'd parameterize the instantiation as <BooleanQuery,TermQuery,TermScrorer> (hmm though TermScorer isn't public; maybe just leave S as Scorer)? I think this makes sense. jep, this is exactly what I was aiming at. Scorer might be an exception but in general this it what you would do. A special case where I could think of is FuzzyQuery where you can tell from the rewrite method what kind of query will be constructed and where this information could be useful (think of spellchecking of similar usecases) I'll work on these but it'll be a little while - I'm out on "vacation". Good to hear, enjoy! I might update you patch accordingly before you come back though! simon
          Hide
          Michael McCandless added a comment -

          Good to hear, enjoy! I might update you patch accordingly before you come back though!

          Please do!! Feel free

          Show
          Michael McCandless added a comment - Good to hear, enjoy! I might update you patch accordingly before you come back though! Please do!! Feel free
          Hide
          Simon Willnauer added a comment -

          Attaching a slightly new version of the patch incorporating roughly the idea of 3-distinct methods. Yet, having 3 methods instead of one was kind of ugly to implement and I didn't want to add switch case statements to visitSubScorer methods just to call a different method.

          Anyway, I changed the 3 parameter methods to take a single argument holding all required information. This is way more extensible and flexible since we don't need to change a method signature just to add a new param.
          I am not sure about the 3 different callbacks but I thought I would leave it in the patch for now.

          Parent, Child and Scorer became generic arguments.

          simon

          Show
          Simon Willnauer added a comment - Attaching a slightly new version of the patch incorporating roughly the idea of 3-distinct methods. Yet, having 3 methods instead of one was kind of ugly to implement and I didn't want to add switch case statements to visitSubScorer methods just to call a different method. Anyway, I changed the 3 parameter methods to take a single argument holding all required information. This is way more extensible and flexible since we don't need to change a method signature just to add a new param. I am not sure about the 3 different callbacks but I thought I would leave it in the patch for now. Parent, Child and Scorer became generic arguments. simon
          Hide
          Michael McCandless added a comment -

          I think the API changes are too complex/heavyweight here!

          With this new patch we've added Scorer.ScorerVisitor,
          Scorer.SubScorerCallback, Scorer.ScorerContext, and two new
          Scorer.accept methods.

          The original patch added only a 1-method interface
          "VisitSubScorer".

          Can't we simplify this? I liked your original proposal, breaking out
          explicit visitRequired, visitOptional, etc., but keeping this as a
          single class. Or we can go back to the original patch (just passing
          an arg expressing the relationship)?

          I also don't like the "context" approach, setting attrs on a shared
          instance. This is basically setting up arguments to pass to the
          callback – why not simply pass these arguments (on the stack)
          instead?

          I don't like the "accept" name – it's very generic – can we put this
          back to visitSubScorers or something that makes it clear you're
          visiting the full sub-tree (visitScorers? visitScorerTree?)?

          Show
          Michael McCandless added a comment - I think the API changes are too complex/heavyweight here! With this new patch we've added Scorer.ScorerVisitor, Scorer.SubScorerCallback, Scorer.ScorerContext, and two new Scorer.accept methods. The original patch added only a 1-method interface "VisitSubScorer". Can't we simplify this? I liked your original proposal, breaking out explicit visitRequired, visitOptional, etc., but keeping this as a single class. Or we can go back to the original patch (just passing an arg expressing the relationship)? I also don't like the "context" approach, setting attrs on a shared instance. This is basically setting up arguments to pass to the callback – why not simply pass these arguments (on the stack) instead? I don't like the "accept" name – it's very generic – can we put this back to visitSubScorers or something that makes it clear you're visiting the full sub-tree (visitScorers? visitScorerTree?)?
          Hide
          Simon Willnauer added a comment -

          Can't we simplify this? I liked your original proposal, breaking out
          explicit visitRequired, visitOptional, etc., but keeping this as a
          single class. Or we can go back to the original patch (just passing
          an arg expressing the relationship)?

          The problem leading to this quiet heavy change was that I wanted to call a specific function for each Boolean.Occur relationship. That turned out to be ugly since I had to either add a switch / case statement to each of the visitSubScorers methods or add a visitSubScorers method for each relationship. I didn't like that at all so I moved forward to specify a callback for each relationship. I see your point of being to heavy though. I guess we should go back to the original approach since I don't want to decide it on a switch case basis.

          also don't like the "context" approach, setting attrs on a shared
          instance. This is basically setting up arguments to pass to the
          callback - why not simply pass these arguments (on the stack)
          instead?

          Passing it on the stack has one limitation which is that if I want to pass more information around in the future I need to change the interface while I only add a member to the "context" if I pass it that way. Another reason was that if I want to pass custom info from a custom query scorer to another I can not do that since there is not context.

          One other solution would be to reduce the interface back to:

          public abstract class ScorerVisitor<P extends Query, C extends Query, S extends Scorer>{
             public Occur relationship;
             public abstract void visit(P parent, C child, S childScorer);
          }
          

          but hold the relationship as an attribute that way we don't have it in the method signature which I don't like though.

          I don't like the "accept" name - it's very generic - can we put this
          back to visitSubScorers or something that makes it clear you're
          visiting the full sub-tree (visitScorers? visitScorerTree?)?

          I'm ok with visitScorers

          Show
          Simon Willnauer added a comment - Can't we simplify this? I liked your original proposal, breaking out explicit visitRequired, visitOptional, etc., but keeping this as a single class. Or we can go back to the original patch (just passing an arg expressing the relationship)? The problem leading to this quiet heavy change was that I wanted to call a specific function for each Boolean.Occur relationship. That turned out to be ugly since I had to either add a switch / case statement to each of the visitSubScorers methods or add a visitSubScorers method for each relationship. I didn't like that at all so I moved forward to specify a callback for each relationship. I see your point of being to heavy though. I guess we should go back to the original approach since I don't want to decide it on a switch case basis. also don't like the "context" approach, setting attrs on a shared instance. This is basically setting up arguments to pass to the callback - why not simply pass these arguments (on the stack) instead? Passing it on the stack has one limitation which is that if I want to pass more information around in the future I need to change the interface while I only add a member to the "context" if I pass it that way. Another reason was that if I want to pass custom info from a custom query scorer to another I can not do that since there is not context. One other solution would be to reduce the interface back to: public abstract class ScorerVisitor<P extends Query, C extends Query, S extends Scorer>{ public Occur relationship; public abstract void visit(P parent, C child, S childScorer); } but hold the relationship as an attribute that way we don't have it in the method signature which I don't like though. I don't like the "accept" name - it's very generic - can we put this back to visitSubScorers or something that makes it clear you're visiting the full sub-tree (visitScorers? visitScorerTree?)? I'm ok with visitScorers
          Hide
          Simon Willnauer added a comment -

          another iteration moving back to the originally discussed approach. The switch case block shouldn't be any bottleneck since the visitor should not run for each scored doc anyway.

          Show
          Simon Willnauer added a comment - another iteration moving back to the originally discussed approach. The switch case block shouldn't be any bottleneck since the visitor should not run for each scored doc anyway.
          Hide
          Michael McCandless added a comment -

          Patch looks good! Thanks, Simon.

          Can we factor out that switch statement into super's visitSubScorers? This way the base (Scorer) impl would be responsible for visiting "this", and then the subclass would override only if it had subs (ie BS, BS2)?

          Also, we are missing some scorers (SpanScorer, ConstantScoreQuery.ConstantScorer, probably others), but if we do the super approach, we'd get these "for free" (I think?).

          Oh I see we can't quite have Scorer impl this because it doesn't know the query. But maybe we can factor out a common method, that the subclass passed the query to?

          Show
          Michael McCandless added a comment - Patch looks good! Thanks, Simon. Can we factor out that switch statement into super's visitSubScorers? This way the base (Scorer) impl would be responsible for visiting "this", and then the subclass would override only if it had subs (ie BS, BS2)? Also, we are missing some scorers (SpanScorer, ConstantScoreQuery.ConstantScorer, probably others), but if we do the super approach, we'd get these "for free" (I think?). Oh I see we can't quite have Scorer impl this because it doesn't know the query. But maybe we can factor out a common method, that the subclass passed the query to?
          Hide
          Simon Willnauer added a comment -

          Oh I see we can't quite have Scorer impl this because it doesn't know the query. But maybe we can factor out a common method, that the subclass passed the query to?

          I had the same idea in a previous iteration but since Scorer doesn't know about the Query the scorer concerns I can not do the call. One way of doing it would be adding the scorers Weight as a protected final member since Weight already has a #getQuery() method we can easily access it or throw an UnsupportedOperationException if the weight is null (force it via ctor and have a default one which sets it to null).

          Since the most of the scorers know their Weight anyway and would need to call the visitor we can also factor it out.

          Also, we are missing some scorers (SpanScorer, ConstantScoreQuery.ConstantScorer, probably others), but if we do the super approach, we'd get these "for free" (I think?).

          most of them would then be for free though!

          Thoughts?

          Show
          Simon Willnauer added a comment - Oh I see we can't quite have Scorer impl this because it doesn't know the query. But maybe we can factor out a common method, that the subclass passed the query to? I had the same idea in a previous iteration but since Scorer doesn't know about the Query the scorer concerns I can not do the call. One way of doing it would be adding the scorers Weight as a protected final member since Weight already has a #getQuery() method we can easily access it or throw an UnsupportedOperationException if the weight is null (force it via ctor and have a default one which sets it to null). Since the most of the scorers know their Weight anyway and would need to call the visitor we can also factor it out. Also, we are missing some scorers (SpanScorer, ConstantScoreQuery.ConstantScorer, probably others), but if we do the super approach, we'd get these "for free" (I think?). most of them would then be for free though! Thoughts?
          Hide
          Michael McCandless added a comment -

          One way of doing it would be adding the scorers Weight as a protected final member since Weight already has a #getQuery() method we can easily access it or throw an UnsupportedOperationException if the weight is null (force it via ctor and have a default one which sets it to null).

          +1

          This would then make a default base impl work well for all leaf queries.

          Show
          Michael McCandless added a comment - One way of doing it would be adding the scorers Weight as a protected final member since Weight already has a #getQuery() method we can easily access it or throw an UnsupportedOperationException if the weight is null (force it via ctor and have a default one which sets it to null). +1 This would then make a default base impl work well for all leaf queries.
          Hide
          Simon Willnauer added a comment -

          Another iteration with Weight as a protected member or Scorer. All scorers I looked at had the weight already as a member so this change makes things way simpler though. I think this is close to commit.

          Show
          Simon Willnauer added a comment - Another iteration with Weight as a protected member or Scorer. All scorers I looked at had the weight already as a member so this change makes things way simpler though. I think this is close to commit.
          Hide
          Michael McCandless added a comment -

          Looks great!

          Though, can't BS and BS2 just call super.visitSubScorers first, and then visit their subs? (Ie right now they dup super's code right?).

          Show
          Michael McCandless added a comment - Looks great! Though, can't BS and BS2 just call super.visitSubScorers first, and then visit their subs? (Ie right now they dup super's code right?).
          Hide
          Simon Willnauer added a comment -

          Though, can't BS and BS2 just call super.visitSubScorers first, and then visit their subs? (Ie right now they dup super's code right?).

          Nah, good point mike I missed that, nice code reuse though! - will fix that soon.

          Show
          Simon Willnauer added a comment - Though, can't BS and BS2 just call super.visitSubScorers first, and then visit their subs? (Ie right now they dup super's code right?). Nah, good point mike I missed that, nice code reuse though! - will fix that soon.
          Hide
          Simon Willnauer added a comment -

          new patch - fixed the code dup

          Show
          Simon Willnauer added a comment - new patch - fixed the code dup
          Hide
          Michael McCandless added a comment -

          Patch looks great Simon, thanks! +1 to commit.

          Show
          Michael McCandless added a comment - Patch looks great Simon, thanks! +1 to commit.
          Hide
          Simon Willnauer added a comment -

          Committed revision 991310

          Thanks Mike

          Show
          Simon Willnauer added a comment - Committed revision 991310 Thanks Mike
          Hide
          Michael McCandless added a comment -

          Thank you!

          I think we should also backport to 3.x?

          Show
          Michael McCandless added a comment - Thank you! I think we should also backport to 3.x?
          Hide
          Simon Willnauer added a comment -

          I think we should also backport to 3.x?

          Yeah we should I guess - I will look into this tomorrow.

          Show
          Simon Willnauer added a comment - I think we should also backport to 3.x? Yeah we should I guess - I will look into this tomorrow.
          Hide
          Michael McCandless added a comment -

          OK sounds good...

          Show
          Michael McCandless added a comment - OK sounds good...
          Hide
          Simon Willnauer added a comment -

          Ported back to 3.x

          Committed revision 991537.

          Show
          Simon Willnauer added a comment - Ported back to 3.x Committed revision 991537.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development