Issue Details (XML | Word | Printable)

Key: LUCENE-850
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Mike Klaas
Votes: 1
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Easily create queries that transform subquery scores arbitrarily

Created: 26/Mar/07 07:13 PM   Updated: 21/Apr/09 11:18 PM
Return to search
Component/s: Search
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works CustomBoostQuery.java 2007-08-27 10:54 PM Mike Klaas 12 kB
File prodscorer.patch.diff 2007-03-26 07:16 PM Mike Klaas 39 kB
Issue Links:
Reference
 

Lucene Fields: Patch Available


 Description  « Hide
Refactor DisMaxQuery into SubQuery(Query|Scorer) that admits easy subclassing. An example is given for multiplicatively combining scores.

Note: patch is not clean; for demonstration purposes only.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Mike Klaas added a comment - 26/Mar/07 07:16 PM
Generify the subquery handling logic of DisMax to make it easy to build subquery scorers.

This patch is demonstrative only. There are no tests, and I'm pretty sure the query norm calculation isn't correct in general.


Doron Cohen added a comment - 03/May/07 03:36 AM
The ability to transform doc scores obtained by a query is now part of LUCENE-446

I think that to a certain extent, the patch in this issue went farther than that of LUCENE-446. Here it seems that
scores of any set of queries can be combined. But in 446, the score transformation is applied on 2 or 3 scores:

1. score of a single sub-query (any query).
2. docid
3. score of a single, optional, sub-field-score-query.

The latter is optional. The latter is the one that assigns a score equals to the value of an indexed field.

For this reason I hesitated to mark this issue as a duplicate of LUCENE-446.

But I did not want to basically re-implement BooleanQuery for a multi-queries score transformation.
And, for the use cases that I can think of the 3-way approach in LUCENE-446 is sufficiently flexible.

Thoughts?


Doron Cohen added a comment - 05/Jun/07 05:23 PM
Mike,

If I understood it correctly your patch can be described as:

  • turn DisMaxQuery into a private case of a new generalized "CustomizableOrQuery"
  • demostrates this customizability with a new ProductQuery.
  • DisMax(OR)Query logic is as before = max =f scob-scores plus tie breaker.
  • Product(OR)Query logic is: score = multiplication of scores of sub-scorers.

The regular Bolean Or could probably be phrased this way as Sum(OR)Qurey.

Now in LUCENE-446 I added CustomScoreQuery, which is simpler:

  • score = f (score(q), score(vq))
    where
  • f() is overridable,
  • q is any query
  • vq is optional, and it is a value-source-query, likely based on (cached) field values.

So it currently doesn't support your comment
"I've often wanted to multiply the scores of two queries".

When first writing CustomScoreQuery I looked at combining any two or N subqueries, but wasn't sure how to do this. How to normalize. How to calculate the weights. But now I think that we could perhaps follow your approach closer: call it CustomOrQuery, go for any N subqueries, and define f() accordingly.

But is this really required / useful?
What are the use cases for this general/arbiterary combining of scores (beyond current capabilities of o.a.l.search.function)?

Thanks,
Doron


Tim Sturge added a comment - 03/Jul/07 07:31 PM
I just asked for a product scored BooleanQuery on java-users and Mike pointed me in the direction of this bug. My use case is to get the non-phrase query "John Bush" to rank "John Bush" higher than "George Bush" or "John Kerry". I believe this is a common use case (I have 3 or 4 bugs filed against search quality internally that boil down to this issue.)

Mike Klaas added a comment - 03/Jul/07 09:05 PM
Hi Doron,

The main use case is the same as for documents (and to a lesser extent, field-) boosts: the ability to weight a document by a certain amount (rather than adding an additive boost, as adding an additional subclause to the query would entail).

The function query capability works for many situations, as you can store the various types of boosts in a FieldCache and use your approach. But this doesn't scale when there are tons of possible boost fields (which would usually be sparsely-populated). SparseFieldCache, anyone?

I decided to move away from ProductQueries for the time being, so that is no longer the main use case of this patch. Primarily the patch stems from developer frustration of implementing something like ProductQuery. ISTM that the subquery-handling logic (present in BooleanQuery and slightly different in DisMaxQuery) needn't be so tightly coupled with a choice of scoring function.

For the record, DisMax is actually a ( x*Max + (1-x)*Sum ) Query, so it is both Sum and Max. Perhaps if we add Prod to the options, there are no more useful subquery combinators?


Mike Klaas added a comment - 03/Jul/07 09:08 PM
Tim: That is typically done by adding an optional implicit phrase query:

john bush -> +(john bush) "john bush"~1000

This works very well for two term queries, but less well when there is more than that. See also DisjunctionMaxQuery if there are multiple fields


Doron Cohen added a comment - 16/Jul/07 11:58 PM
> The function query capability works for many situations, as you
> can store the various types of boosts in a FieldCache and use
> your approach. But this doesn't scale when there are tons of
> possible boost fields (which would usually be sparsely-populated).
> SparseFieldCache, anyone?

For large collections loading would indeed take long.
Quoting Michael, payloads will be more efficient for this case. Two options actually:

  • faster reading values into a cache
  • value-source that feeds on the fly from payloads.

Mike Klaas added a comment - 27/Aug/07 10:54 PM
Here's an approach I think will work.

Rename CustomScoreQuery to CustomBoostQuery, and remove the ValueSource-specific logic. Really there is no reason to limit the logic to ValueSource queries: the only important criterion is that we don't expect the docs matches against the boosting query only to be returned (the doc set is unchanged relative to the original query).

I'm not sure what will happen if the boost query doesn't match the document being boosted, however. Perhaps there should be a default value?

Does this still belong in the function package?


Mike Klaas added a comment - 31/Aug/07 01:57 AM
Do address the issue above, the following needs to be added:
===================================================================
— build-src/java/solr/org/apache/lucene/search/CustomBoostQuery.java (revision 9312)
+++ build-src/java/solr/org/apache/lucene/search/CustomBoostQuery.java (working copy)
@@ -280,7 +280,7 @@

/*(non-Javadoc) @see org.apache.lucene.search.Scorer#score() */
public float score() throws IOException { - float boostScore = (boostScorer==null ? 1 : boostScorer.score()); + float boostScore = (boostScorer==null || subQueryScorer.doc() != boostScorer.doc() ? 1 : boos tScorer.score()); return qWeight * customScore(subQueryScorer.doc(), subQueryScorer.score(), boostScore); }

@@ -300,7 +300,8 @@
return subQueryExpl;
}
// match

  • Explanation boostExpl = boostScorer==null ? null : boostScorer.explain(doc);
    + Explanation boostExpl = boostScorer==null ? null :
    + weight.qStrict ? boostScorer.explain(doc) : weight.boostWeight.explain(reader,doc);
    Explanation customExp = customExplain(doc,subQueryExpl,boostExpl);
    float sc = qWeight * customExp.getValue();
    Explanation res = new ComplexExplanation(