Index: src/java/org/apache/lucene/search/Similarity.java =================================================================== --- src/java/org/apache/lucene/search/Similarity.java (revision 447480) +++ src/java/org/apache/lucene/search/Similarity.java (working copy) @@ -28,56 +28,249 @@ /** Expert: Scoring API. *
Subclasses implement search scoring. + * + *
The score of query q for document d correlates to the
+ * cosine-distance or dot-product between document and query vectors in a
+ *
+ * Vector Space Model (VSM) of Information Retrieval.
+ * A document whose vector is closer to the query vector in that model is scored higher.
+ *
+ *
The score is computed as follows: * - *
The score of query q for document d is defined
- * in terms of these methods as follows:
- *
- *
| score(q,d) = |
- * - * Σ | - *- * ( {@link #tf(int) tf}(t in d) * - * {@link #idf(Term,Searcher) idf}(t)^2 * - * {@link Query#getBoost getBoost}(t in q) * - * {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) * - * {@link #lengthNorm(String,int) lengthNorm}(t.field in d) ) - * | - ** - * {@link #coord(int,int) coord}(q,d) * - * {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights) + * | + * score(q,d) = + * {@link #coord(int,int) coord}(q,d) · + * {@link #queryNorm(float) normalizer}(q) · * | - *
| - * t in q + * | + * ∑ * | + *+ * ( + * {@link #tf(int) tf}(t in d) · + * {@link #idf(Term,Searcher) idf}(t)2 · + * {@link Query#getBoost searchBoost}(t in q) · + * indexBoost(t,d) + * ) + * | *||
| + * | t in q | + *+ * |
where + *
| + * normalizer(q) = + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm (sumOfSquaredWeights)} + * = + * | + *
+ *
|
+ *
| + * {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} = + * {@link org.apache.lucene.search.Query#getBoost() searchBoost(q)} 2 + * · + * | + *+ * ∑ + * | + *+ * ( + * {@link #idf(Term,Searcher) idf}(t) · + * {@link Query#getBoost searchBoost}(t in q) + * ) 2 + * | + *
| + * | t in q | + *+ * |
| + * tf(t in d) = + * | + *+ * frequency½ + * | + *
| + * idf(t) = + * | + *+ * 1 + log ( + * | + *
+ *
|
+ * + * ) + * | + *
t in document d that was set at indexing time.
+ * At search time it would be too late to modify this part of the scoring.
+ * A few factors come into play here, accounting for fields named the same as the term t:
+ *
+ * | + * indexBoost(t in d) = + * {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()} + * · + * {@link #lengthNorm(String, int) lengthNorm(field)} + * · + * | + *+ * ∏ + * | + *+ * {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}() + * | + *
| + * | field f in d named as t | + *+ * |
Note that the above formula is motivated by the cosine-distance or dot-product
- * between document and query vector, which is implemented by {@link DefaultSimilarity}.
- *
* @see #setDefault(Similarity)
* @see IndexWriter#setSimilarity(Similarity)
* @see Searcher#setSimilarity(Similarity)
Index: xdocs/scoring.xml
===================================================================
--- xdocs/scoring.xml (revision 447480)
+++ xdocs/scoring.xml (working copy)
@@ -1,307 +1,354 @@
-
-
- Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
- In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
- work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
- scores lower than a different document with only one of the query terms. While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
- help you figure out the what and why of Lucene scoring. Lucene scoring uses a combination of the
- Vector Space Model (VSM) of Information
- Retrieval and the Boolean model
- to determine
- how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
- times a query term appears in a document relative to
- the number of times the term appears in all the documents in the collection, the more relevant that
- document is to the query. It uses the Boolean model to first narrow down the documents that need to
- be scored based on the use of boolean logic in the Query specification. Lucene also adds some
- capabilities and refinements onto this model to support boolean and fuzzy searching, but it
- essentially remains a VSM based system at the heart.
- For some valuable references on VSM and IR in general refer to the
- Lucene Wiki IR references.
- The rest of this document will cover Scoring basics and how to change your
- Similarity. Next it will cover ways you can
- customize the Lucene internals in Changing your Scoring
- -- Expert Level which gives details on implementing your own
- Query class and related functionality. Finally, we
- will finish up with some reference material in the Appendix.
- Scoring is very much dependent on the way documents are indexed,
- so it is important to understand indexing (see
- Apache Lucene - Getting Started Guide
- and the Lucene
- file formats
- before continuing on with this section.) It is also assumed that readers know how to use the
- Searcher.explain(Query query, int doc) functionality,
- which can go a long way in informing why a score is returned.
- In Lucene, the objects we are scoring are
- Documents. A Document is a collection
- of
- Fields. Each Field has semantics about how
- it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to
- note that Lucene scoring works on Fields and then combines the results to return Documents. This is
- important because two Documents with the exact same content, but one having the content in two Fields
- and the other in one Field will return different scores for the same query due to length normalization
- (assumming the
- DefaultSimilarity
- on the Fields).
-
- Lucene's scoring formula computes the score of one document d for a given query q across each
- term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more
- relevant document d is to the query q. This is taken from
- Similarity:
-
-
- where - -
- -- This scoring formula is mostly implemented in the - TermScorer class, where it makes calls to the - Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation: -
idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.
getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.
lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.
coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.
queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable - GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) - that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem - to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?
OK, so the tf-idf formula and the - Similarity - is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are - the use and interactions between the - Query classes, as created by each application in - response to a user's information need. -
-In this regard, Lucene offers a wide variety of Query implementations, most of which are in the - org.apache.lucene.search package. - These implementations can be combined in a wide variety of ways to provide complex querying - capabilities along with - information about where matches took place in the document collection. The Query - section below - highlights some of the more important Query classes. For information on the other ones, see the - package summary. For details on implementing - your own Query class, see Changing your Scoring -- - Expert Level below. -
-Once a Query has been created and submitted to the - IndexSearcher, the scoring process - begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, - control finally passes to the Weight implementation and its - Scorer instance. In the case of any type of - BooleanQuery, scoring is handled by the - BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), - unless the static - - BooleanQuery#setUseScorer14(boolean) method is set to true, - in which case the - BooleanWeight - (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default. - See CHANGES.txt under release 1.9 RC1 for more information on choosing which Scorer to use. -
-- Assuming the use of the BooleanWeight2, a - BooleanScorer2 is created by bringing together - all of the - Scorers from the sub-clauses of the BooleanQuery. - When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type - of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores - provided by each scorer while factoring in the coord() score. - -
-For information on the Query Classes, refer to the - search package javadocs -
-One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on - how to do this, see the - search package javadocs
-At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more - about how to do this, refer to the - search package javadocs -
-FILL IN HERE. Volunteers?
-GSI Note: This section is mostly my notes on stepping through the Scoring process and serves as - fertilizer for the earlier sections.
-In the typical search application, a - Query - is passed to the - Searcher - , beginning the scoring process. -
-Once inside the Searcher, a - Hits - object is constructed, which handles the scoring and caching of the search results. - The Hits constructor stores references to three or four important objects: -
Now that the Hits object has been initialized, it begins the process of identifying documents that - match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't - effect the raw Lucene score), - we call on the "expert" search method of the Searcher, passing in our - Weight - object, - Filter - and the number of results we want. This method - returns a - TopDocs - object, which is an internal collection of search results. - The Searcher creates a - TopDocCollector - and passes it along with the Weight, Filter to another expert search method (for more on the - HitCollector - mechanism, see - Searcher - .) The TopDocCollector uses a - PriorityQueue - to collect the top results for the search. -
-If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, - we ask the Weight for - a - Scorer - for the - IndexReader - of the current searcher and we proceed by - calling the score method on the - Scorer - . -
-At last, we are actually going to score some documents. The score method takes in the HitCollector - (most likely the TopDocCollector) and does its business. - Of course, here is where things get involved. The - Scorer - that is returned by the - Weight - object depends on what type of Query was submitted. In most real world applications with multiple - query terms, - the - Scorer - is going to be a - BooleanScorer2 - (see the section on customizing your scoring for info on changing this.) - -
-Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the - coord() factor. We then - get a internal Scorer based on the required, optional and prohibited parts of the query. - Using this internal Scorer, the BooleanScorer2 then proceeds - into a while loop based on the Scorer#next() method. The next() method advances to the next document - matching the query. This is an - abstract method in the Scorer class and is thus overriden by all derived - implementations. If you have a simple OR query - your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers - from the sub scorers of the OR'd terms.
-Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. + In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to + work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms + scores lower than a different document with only one of the query terms.
+While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can + help you figure out the what and why of Lucene scoring.
+Lucene scoring uses a combination of the + Vector Space Model (VSM) of Information + Retrieval and the Boolean model + to determine + how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more + times a query term appears in a document relative to + the number of times the term appears in all the documents in the collection, the more relevant that + document is to the query. It uses the Boolean model to first narrow down the documents that need to + be scored based on the use of boolean logic in the Query specification. Lucene also adds some + capabilities and refinements onto this model to support boolean and fuzzy searching, but it + essentially remains a VSM based system at the heart. + For some valuable references on VSM and IR in general refer to the + Lucene Wiki IR references. +
+The rest of this document will cover Scoring basics and how to change your + Similarity. Next it will cover ways you can + customize the Lucene internals in Changing your Scoring + -- Expert Level which gives details on implementing your own + Query class and related functionality. Finally, we + will finish up with some reference material in the Appendix. +
+Scoring is very much dependent on the way documents are indexed, + so it is important to understand indexing (see + Apache Lucene - Getting Started Guide + and the Lucene + file formats + before continuing on with this section.) It is also assumed that readers know how to use the + Searcher.explain(Query query, int doc) functionality, + which can go a long way in informing why a score is returned. +
+In Lucene, the objects we are scoring are + Documents. A Document is a collection + of + Fields. Each Field has semantics about how + it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to + note that Lucene scoring works on Fields and then combines the results to return Documents. This is + important because two Documents with the exact same content, but one having the content in two Fields + and the other in one Field will return different scores for the same query due to length normalization + (assumming the + DefaultSimilarity + on the Fields). +
+Lucene allows influencing search results by "boosting" in more than one level: +
Indexing time boosts are preprocessed for storage efficiency and written to + the directory (when writing the document) in a single byte (!) as follows: + For each field of a document, all boosts of that field + (i.e. all boosts under the same field name in that doc) are multiplied. + The result is multiplied by the boost of the document, + and also multiplied by a "field length norm" value + that represents the length of that field in that doc + (so shorter fields are automatically boosted up). + The result is decoded as a single byte + (with some precision loss of course) and stored in the directory. + The similarity object in effect at indexing computes the length-norm of the field. +
+This composition of 1-byte representation of norms + (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) + is nicely described in + Fieldable.setBoost(). +
+Encoding and decoding of the resulted float norm in a single byte are done by the + static methods of the class Similarity: + encodeNorm() and + decodeNorm(). + Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, + e.g. decode(encode(0.89)) = 0.75. + At scoring (search) time, this norm is brought into the score of document + as indexBoost, as shown by the formula in + Similarity. +
++ Lucene's scoring formula computes the score of one document d for a given query q across each + term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more + relevant document d is to the query q. This is taken from + Similarity: + +
+ where + +
+ ++ This scoring formula is mostly implemented in the + TermScorer class, where it makes calls to the + Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation: +
idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.
getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.
lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.
coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.
queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable + GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) + that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem + to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?
OK, so the tf-idf formula and the + Similarity + is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are + the use and interactions between the + Query classes, as created by each application in + response to a user's information need. +
+In this regard, Lucene offers a wide variety of Query implementations, most of which are in the + org.apache.lucene.search package. + These implementations can be combined in a wide variety of ways to provide complex querying + capabilities along with + information about where matches took place in the document collection. The Query + section below + highlights some of the more important Query classes. For information on the other ones, see the + package summary. For details on implementing + your own Query class, see Changing your Scoring -- + Expert Level below. +
+Once a Query has been created and submitted to the + IndexSearcher, the scoring process + begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, + control finally passes to the Weight implementation and its + Scorer instance. In the case of any type of + BooleanQuery, scoring is handled by the + BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), + unless the static + + BooleanQuery#setUseScorer14(boolean) method is set to true, + in which case the + BooleanWeight + (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default. + See CHANGES.txt under release 1.9 RC1 for more information on choosing which Scorer to use. +
++ Assuming the use of the BooleanWeight2, a + BooleanScorer2 is created by bringing together + all of the + Scorers from the sub-clauses of the BooleanQuery. + When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type + of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores + provided by each scorer while factoring in the coord() score. + +
+For information on the Query Classes, refer to the + search package javadocs +
+One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on + how to do this, see the + search package javadocs
+At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more + about how to do this, refer to the + search package javadocs +
+FILL IN HERE. Volunteers?
+GSI Note: This section is mostly my notes on stepping through the Scoring process and serves as + fertilizer for the earlier sections.
+In the typical search application, a + Query + is passed to the + Searcher + , beginning the scoring process. +
+Once inside the Searcher, a + Hits + object is constructed, which handles the scoring and caching of the search results. + The Hits constructor stores references to three or four important objects: +
Now that the Hits object has been initialized, it begins the process of identifying documents that + match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't + effect the raw Lucene score), + we call on the "expert" search method of the Searcher, passing in our + Weight + object, + Filter + and the number of results we want. This method + returns a + TopDocs + object, which is an internal collection of search results. + The Searcher creates a + TopDocCollector + and passes it along with the Weight, Filter to another expert search method (for more on the + HitCollector + mechanism, see + Searcher + .) The TopDocCollector uses a + PriorityQueue + to collect the top results for the search. +
+If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, + we ask the Weight for + a + Scorer + for the + IndexReader + of the current searcher and we proceed by + calling the score method on the + Scorer + . +
+At last, we are actually going to score some documents. The score method takes in the HitCollector + (most likely the TopDocCollector) and does its business. + Of course, here is where things get involved. The + Scorer + that is returned by the + Weight + object depends on what type of Query was submitted. In most real world applications with multiple + query terms, + the + Scorer + is going to be a + BooleanScorer2 + (see the section on customizing your scoring for info on changing this.) + +
+Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the + coord() factor. We then + get a internal Scorer based on the required, optional and prohibited parts of the query. + Using this internal Scorer, the BooleanScorer2 then proceeds + into a while loop based on the Scorer#next() method. The next() method advances to the next document + matching the query. This is an + abstract method in the Scorer class and is thus overriden by all derived + implementations. If you have a simple OR query + your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers + from the sub scorers of the OR'd terms.
+