Index: src/java/org/apache/lucene/search/Similarity.java
===================================================================
--- src/java/org/apache/lucene/search/Similarity.java (revision 447480)
+++ src/java/org/apache/lucene/search/Similarity.java (working copy)
@@ -28,56 +28,260 @@
/** Expert: Scoring API.
*
Subclasses implement search scoring.
- *
- *
The score of query q for document d is defined
- * in terms of these methods as follows:
- *
- *
+ *
+ * The score of query q for document d correlates to the
+ * cosine-distance or dot-product between document and query vectors in a
+ *
+ * Vector Space Model (VSM) of Information Retrieval.
+ * A document whose vector is closer to the query vector in that model is scored higher.
+ *
+ * The score is computed as follows:
+ *
+ *
+ *
+ *
+ *
*
- * score(q,d) =
|
- *
- * Σ |
- *
- * ( {@link #tf(int) tf}(t in d) *
- * {@link #idf(Term,Searcher) idf}(t)^2 *
- * {@link Query#getBoost getBoost}(t in q) *
- * {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) *
- * {@link #lengthNorm(String,int) lengthNorm}(t.field in d) )
- * |
- * *
- * {@link #coord(int,int) coord}(q,d) *
- * {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights)
+ * |
+ * score(q,d) =
+ * coord(q,d) ·
+ * queryNorm(q) ·
* |
- *
- *
- * |
- * t in q
+ * |
+ * ∑
* |
+ *
+ * (
+ * tf(t in d) ·
+ * idf(t)2 ·
+ * t.getBoost() ·
+ * norm(t,d)
+ * )
+ * |
*
+ *
+ * |
+ * t in q |
+ * |
+ *
*
+ * |
+ *
*
* where
+ *
+ * -
+ *
+ * tf(t in d)
+ * correlates to the term's frequency,
+ * defined as the number of times term t appears in the currently scored document d.
+ * Documents that have more occurrences of a given term receive a higher score.
+ * The default computation for tf(t in d) in
+ * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is:
*
- *
- *
- * sumOfSqaredWeights =
|
- *
- * Σ |
- *
- * ( {@link #idf(Term,Searcher) idf}(t) *
- * {@link Query#getBoost getBoost}(t in q) )^2
- * |
- *
- *
- * |
- * t in q
- * |
- *
- *
+ *
+ *
+ *
+ * |
+ * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)} =
+ * |
+ *
+ * frequency½
+ * |
+ *
+ *
+ *
+ *
*
- * Note that the above formula is motivated by the cosine-distance or dot-product
- * between document and query vector, which is implemented by {@link DefaultSimilarity}.
- *
+ *
-
+ *
+ * idf(t) stands for Inverse Document Frequency. This value
+ * correlates to the inverse of docFreq
+ * (the number of documents in which the term t appears).
+ * This means rarer terms give higher contribution to the total score.
+ * The default computation for idf(t) in
+ * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is:
+ *
+ *
+ *
+ *
+ * |
+ * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)} =
+ * |
+ *
+ * 1 + log (
+ * |
+ *
+ *
+ * | numDocs |
+ * | ––––––––– |
+ * | docFreq+1 |
+ *
+ * |
+ *
+ * )
+ * |
+ *
+ *
+ *
+ *
+ *
+ * -
+ *
+ * coord(q,d)
+ * is a score factor based on how many of the query terms are found in the specified document.
+ * Typically, a document that contains more of the query's terms will receive a higher score
+ * than another document with fewer query terms.
+ * This is a search time factor computed in
+ * {@link #coord(int, int) coord(q,d)}
+ * by the Similarity in effect at search time.
+ *
+ *
+ *
+ * -
+ *
+ * queryNorm(q)
+ *
+ * is a normalizing factor used to make scores between queries comparable.
+ * This factor does not affect document ranking (since all ranked documents are multiplied by the same factor),
+ * but rather just attempts to make scores from different queries (or even different indexes) comparable.
+ * This is a search time factor computed by the Similarity in effect at search time.
+ *
+ * The default computation in
+ * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity}
+ * is:
+ *
+ *
+ *
+ * |
+ * queryNorm(q) =
+ * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)}
+ * =
+ * |
+ *
+ *
+ * | 1 |
+ * |
+ * ––––––––––––––
+ * |
+ * | sumOfSquaredWeights½ |
+ *
+ * |
+ *
+ *
+ *
+ *
+ * The sum of squared weights (of the query terms) is
+ * computed by the query {@link org.apache.lucene.search.Weight} object.
+ * For example, a {@link org.apache.lucene.search.BooleanQuery boolean query}
+ * computes this value as:
+ *
+ *
+ *
+ *
+ * |
+ * {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights} =
+ * {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} 2
+ * ·
+ * |
+ *
+ * ∑
+ * |
+ *
+ * (
+ * idf(t) ·
+ * t.getBoost()
+ * ) 2
+ * |
+ *
+ *
+ * |
+ * t in q |
+ * |
+ *
+ *
+ *
+ *
+ *
+ *
+ * -
+ *
+ * t.getBoost()
+ * is a search time boost of term t in the query q as
+ * specified in the query text
+ * (see query syntax),
+ * or as set by application calls to
+ * {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}.
+ * Notice that there is really no direct API for accessing a boost of one term in a multi term query,
+ * but rather multi terms are represented in a query as multi
+ * {@link org.apache.lucene.search.TermQuery TermQuery} objects,
+ * and so the boost of a term in the query is accessible by calling the sub-query
+ * {@link org.apache.lucene.search.Query#getBoost() getBoost()}.
+ *
+ *
+ *
+ * -
+ *
+ * norm(t,d) encapsulates a few (indexing time) boost and length factors:
+ *
+ *
+ * - Document boost - set by calling
+ * {@link org.apache.lucene.document.Document#setBoost(float) doc.setBoost()}
+ * before adding the document to the index.
+ *
+ * - Field boost - set by calling
+ * {@link org.apache.lucene.document.Fieldable#setBoost(float) field.setBoost()}
+ * before adding the field to a document.
+ *
+ * - {@link #lengthNorm(String, int) lengthNorm(field)} - computed
+ * when the document is added to the index in accordance with the number of tokens
+ * of this field in the document, so that shorter fields contribute more to the score.
+ * LengthNorm is computed by the Similarity class in effect at indexing.
+ *
+ *
+ *
+ *
+ * When a document is added to the index, all the above factors are multiplied.
+ * If the document has multiple fields with the same name, all their boosts are multiplied together:
+ *
+ *
+ *
+ *
+ * |
+ * norm(t,d) =
+ * {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()}
+ * ·
+ * {@link #lengthNorm(String, int) lengthNorm(field)}
+ * ·
+ * |
+ *
+ * ∏
+ * |
+ *
+ * {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}()
+ * |
+ *
+ *
+ * |
+ * field f in d named as t |
+ * |
+ *
+ *
+ *
+ * However the resulted norm value is {@link #encodeNorm(float) encoded} as a single byte
+ * before being stored.
+ * At search time, the norm byte value is read from the index
+ * {@link org.apache.lucene.store.Directory directory} and
+ * {@link #decodeNorm(byte) decoded} back to a float norm value.
+ * This encoding/decoding, while reducing index size, comes with the price of
+ * precision loss - it is not guaranteed that decode(encode(x)) = x.
+ * For instance, decode(encode(0.89)) = 0.75.
+ * Also notice that search time is too late to modify this norm part of scoring, e.g. by
+ * using a different {@link Similarity} for search.
+ *
+ *
+ *
+ *
* @see #setDefault(Similarity)
* @see IndexWriter#setSimilarity(Similarity)
* @see Searcher#setSimilarity(Similarity)