Lucene - Core
  1. Lucene - Core
  2. LUCENE-850

Easily create queries that transform subquery scores arbitrarily

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Refactor DisMaxQuery into SubQuery(Query|Scorer) that admits easy subclassing. An example is given for multiplicatively combining scores.

      Note: patch is not clean; for demonstration purposes only.

      1. prodscorer.patch.diff
        39 kB
        Mike Klaas
      2. CustomBoostQuery.java
        12 kB
        Mike Klaas

        Issue Links

          Activity

          Hide
          Mike Klaas added a comment -

          Generify the subquery handling logic of DisMax to make it easy to build subquery scorers.

          This patch is demonstrative only. There are no tests, and I'm pretty sure the query norm calculation isn't correct in general.

          Show
          Mike Klaas added a comment - Generify the subquery handling logic of DisMax to make it easy to build subquery scorers. This patch is demonstrative only. There are no tests, and I'm pretty sure the query norm calculation isn't correct in general.
          Hide
          Doron Cohen added a comment -

          The ability to transform doc scores obtained by a query is now part of LUCENE-446

          I think that to a certain extent, the patch in this issue went farther than that of LUCENE-446. Here it seems that
          scores of any set of queries can be combined. But in 446, the score transformation is applied on 2 or 3 scores:

          1. score of a single sub-query (any query).
          2. docid
          3. score of a single, optional, sub-field-score-query.

          The latter is optional. The latter is the one that assigns a score equals to the value of an indexed field.

          For this reason I hesitated to mark this issue as a duplicate of LUCENE-446.

          But I did not want to basically re-implement BooleanQuery for a multi-queries score transformation.
          And, for the use cases that I can think of the 3-way approach in LUCENE-446 is sufficiently flexible.

          Thoughts?

          Show
          Doron Cohen added a comment - The ability to transform doc scores obtained by a query is now part of LUCENE-446 I think that to a certain extent, the patch in this issue went farther than that of LUCENE-446 . Here it seems that scores of any set of queries can be combined. But in 446, the score transformation is applied on 2 or 3 scores: 1. score of a single sub-query (any query). 2. docid 3. score of a single, optional, sub-field-score-query. The latter is optional. The latter is the one that assigns a score equals to the value of an indexed field. For this reason I hesitated to mark this issue as a duplicate of LUCENE-446 . But I did not want to basically re-implement BooleanQuery for a multi-queries score transformation. And, for the use cases that I can think of the 3-way approach in LUCENE-446 is sufficiently flexible. Thoughts?
          Hide
          Doron Cohen added a comment -

          Mike,

          If I understood it correctly your patch can be described as:

          • turn DisMaxQuery into a private case of a new generalized "CustomizableOrQuery"
          • demostrates this customizability with a new ProductQuery.
          • DisMax(OR)Query logic is as before = max =f scob-scores plus tie breaker.
          • Product(OR)Query logic is: score = multiplication of scores of sub-scorers.

          The regular Bolean Or could probably be phrased this way as Sum(OR)Qurey.

          Now in LUCENE-446 I added CustomScoreQuery, which is simpler:

          • score = f (score(q), score(vq))
            where
          • f() is overridable,
          • q is any query
          • vq is optional, and it is a value-source-query, likely based on (cached) field values.

          So it currently doesn't support your comment
          "I've often wanted to multiply the scores of two queries".

          When first writing CustomScoreQuery I looked at combining any two or N subqueries, but wasn't sure how to do this. How to normalize. How to calculate the weights. But now I think that we could perhaps follow your approach closer: call it CustomOrQuery, go for any N subqueries, and define f() accordingly.

          But is this really required / useful?
          What are the use cases for this general/arbiterary combining of scores (beyond current capabilities of o.a.l.search.function)?

          Thanks,
          Doron

          Show
          Doron Cohen added a comment - Mike, If I understood it correctly your patch can be described as: turn DisMaxQuery into a private case of a new generalized "CustomizableOrQuery" demostrates this customizability with a new ProductQuery. DisMax(OR)Query logic is as before = max =f scob-scores plus tie breaker. Product(OR)Query logic is: score = multiplication of scores of sub-scorers. The regular Bolean Or could probably be phrased this way as Sum(OR)Qurey. Now in LUCENE-446 I added CustomScoreQuery, which is simpler: score = f (score(q), score(vq)) where f() is overridable, q is any query vq is optional, and it is a value-source-query, likely based on (cached) field values. So it currently doesn't support your comment "I've often wanted to multiply the scores of two queries". When first writing CustomScoreQuery I looked at combining any two or N subqueries, but wasn't sure how to do this. How to normalize. How to calculate the weights. But now I think that we could perhaps follow your approach closer: call it CustomOrQuery, go for any N subqueries, and define f() accordingly. But is this really required / useful? What are the use cases for this general/arbiterary combining of scores (beyond current capabilities of o.a.l.search.function)? Thanks, Doron
          Hide
          Tim Sturge added a comment -

          I just asked for a product scored BooleanQuery on java-users and Mike pointed me in the direction of this bug. My use case is to get the non-phrase query "John Bush" to rank "John Bush" higher than "George Bush" or "John Kerry". I believe this is a common use case (I have 3 or 4 bugs filed against search quality internally that boil down to this issue.)

          Show
          Tim Sturge added a comment - I just asked for a product scored BooleanQuery on java-users and Mike pointed me in the direction of this bug. My use case is to get the non-phrase query "John Bush" to rank "John Bush" higher than "George Bush" or "John Kerry". I believe this is a common use case (I have 3 or 4 bugs filed against search quality internally that boil down to this issue.)
          Hide
          Mike Klaas added a comment -

          Hi Doron,

          The main use case is the same as for documents (and to a lesser extent, field-) boosts: the ability to weight a document by a certain amount (rather than adding an additive boost, as adding an additional subclause to the query would entail).

          The function query capability works for many situations, as you can store the various types of boosts in a FieldCache and use your approach. But this doesn't scale when there are tons of possible boost fields (which would usually be sparsely-populated). SparseFieldCache, anyone?

          I decided to move away from ProductQueries for the time being, so that is no longer the main use case of this patch. Primarily the patch stems from developer frustration of implementing something like ProductQuery. ISTM that the subquery-handling logic (present in BooleanQuery and slightly different in DisMaxQuery) needn't be so tightly coupled with a choice of scoring function.

          For the record, DisMax is actually a ( x*Max + (1-x)*Sum ) Query, so it is both Sum and Max. Perhaps if we add Prod to the options, there are no more useful subquery combinators?

          Show
          Mike Klaas added a comment - Hi Doron, The main use case is the same as for documents (and to a lesser extent, field-) boosts: the ability to weight a document by a certain amount (rather than adding an additive boost, as adding an additional subclause to the query would entail). The function query capability works for many situations, as you can store the various types of boosts in a FieldCache and use your approach. But this doesn't scale when there are tons of possible boost fields (which would usually be sparsely-populated). SparseFieldCache, anyone? I decided to move away from ProductQueries for the time being, so that is no longer the main use case of this patch. Primarily the patch stems from developer frustration of implementing something like ProductQuery. ISTM that the subquery-handling logic (present in BooleanQuery and slightly different in DisMaxQuery) needn't be so tightly coupled with a choice of scoring function. For the record, DisMax is actually a ( x*Max + (1-x)*Sum ) Query, so it is both Sum and Max. Perhaps if we add Prod to the options, there are no more useful subquery combinators?
          Hide
          Mike Klaas added a comment -

          Tim: That is typically done by adding an optional implicit phrase query:

          john bush -> +(john bush) "john bush"~1000

          This works very well for two term queries, but less well when there is more than that. See also DisjunctionMaxQuery if there are multiple fields

          Show
          Mike Klaas added a comment - Tim: That is typically done by adding an optional implicit phrase query: john bush -> +(john bush) "john bush"~1000 This works very well for two term queries, but less well when there is more than that. See also DisjunctionMaxQuery if there are multiple fields
          Hide
          Doron Cohen added a comment -

          > The function query capability works for many situations, as you
          > can store the various types of boosts in a FieldCache and use
          > your approach. But this doesn't scale when there are tons of
          > possible boost fields (which would usually be sparsely-populated).
          > SparseFieldCache, anyone?

          For large collections loading would indeed take long.
          Quoting Michael, payloads will be more efficient for this case. Two options actually:

          • faster reading values into a cache
          • value-source that feeds on the fly from payloads.
          Show
          Doron Cohen added a comment - > The function query capability works for many situations, as you > can store the various types of boosts in a FieldCache and use > your approach. But this doesn't scale when there are tons of > possible boost fields (which would usually be sparsely-populated). > SparseFieldCache, anyone? For large collections loading would indeed take long. Quoting Michael, payloads will be more efficient for this case. Two options actually: faster reading values into a cache value-source that feeds on the fly from payloads.
          Hide
          Mike Klaas added a comment -

          Here's an approach I think will work.

          Rename CustomScoreQuery to CustomBoostQuery, and remove the ValueSource-specific logic. Really there is no reason to limit the logic to ValueSource queries: the only important criterion is that we don't expect the docs matches against the boosting query only to be returned (the doc set is unchanged relative to the original query).

          I'm not sure what will happen if the boost query doesn't match the document being boosted, however. Perhaps there should be a default value?

          Does this still belong in the function package?

          Show
          Mike Klaas added a comment - Here's an approach I think will work. Rename CustomScoreQuery to CustomBoostQuery, and remove the ValueSource-specific logic. Really there is no reason to limit the logic to ValueSource queries: the only important criterion is that we don't expect the docs matches against the boosting query only to be returned (the doc set is unchanged relative to the original query). I'm not sure what will happen if the boost query doesn't match the document being boosted, however. Perhaps there should be a default value? Does this still belong in the function package?
          Hide
          Mike Klaas added a comment -

          Do address the issue above, the following needs to be added:
          ===================================================================
          — build-src/java/solr/org/apache/lucene/search/CustomBoostQuery.java (revision 9312)
          +++ build-src/java/solr/org/apache/lucene/search/CustomBoostQuery.java (working copy)
          @@ -280,7 +280,7 @@

          /*(non-Javadoc) @see org.apache.lucene.search.Scorer#score() */
          public float score() throws IOException

          { - float boostScore = (boostScorer==null ? 1 : boostScorer.score()); + float boostScore = (boostScorer==null || subQueryScorer.doc() != boostScorer.doc() ? 1 : boos tScorer.score()); return qWeight * customScore(subQueryScorer.doc(), subQueryScorer.score(), boostScore); }

          @@ -300,7 +300,8 @@
          return subQueryExpl;
          }
          // match

          • Explanation boostExpl = boostScorer==null ? null : boostScorer.explain(doc);
            + Explanation boostExpl = boostScorer==null ? null :
            + weight.qStrict ? boostScorer.explain(doc) : weight.boostWeight.explain(reader,doc);
            Explanation customExp = customExplain(doc,subQueryExpl,boostExpl);
            float sc = qWeight * customExp.getValue();
            Explanation res = new ComplexExplanation(
          Show
          Mike Klaas added a comment - Do address the issue above, the following needs to be added: =================================================================== — build-src/java/solr/org/apache/lucene/search/CustomBoostQuery.java (revision 9312) +++ build-src/java/solr/org/apache/lucene/search/CustomBoostQuery.java (working copy) @@ -280,7 +280,7 @@ /*(non-Javadoc) @see org.apache.lucene.search.Scorer#score() */ public float score() throws IOException { - float boostScore = (boostScorer==null ? 1 : boostScorer.score()); + float boostScore = (boostScorer==null || subQueryScorer.doc() != boostScorer.doc() ? 1 : boos tScorer.score()); return qWeight * customScore(subQueryScorer.doc(), subQueryScorer.score(), boostScore); } @@ -300,7 +300,8 @@ return subQueryExpl; } // match Explanation boostExpl = boostScorer==null ? null : boostScorer.explain(doc); + Explanation boostExpl = boostScorer==null ? null : + weight.qStrict ? boostScorer.explain(doc) : weight.boostWeight.explain(reader,doc); Explanation customExp = customExplain(doc,subQueryExpl,boostExpl); float sc = qWeight * customExp.getValue(); Explanation res = new ComplexExplanation(
          Hide
          Erick Erickson added a comment -

          SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.

          Show
          Erick Erickson added a comment - SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.

            People

            • Assignee:
              Unassigned
              Reporter:
              Mike Klaas
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development