Index: src/java/org/apache/lucene/search/Similarity.java =================================================================== --- src/java/org/apache/lucene/search/Similarity.java (revision 447480) +++ src/java/org/apache/lucene/search/Similarity.java (working copy) @@ -28,56 +28,260 @@ /** Expert: Scoring API. *

Subclasses implement search scoring. - * - *

The score of query q for document d is defined - * in terms of these methods as follows: - * - * + * + *

The score of query q for document d correlates to the + * cosine-distance or dot-product between document and query vectors in a + * + * Vector Space Model (VSM) of Information Retrieval. + * A document whose vector is closer to the query vector in that model is scored higher. + * + * The score is computed as follows: + * + *

+ *

+ * + *
+ * * - * - * - * - * - * - * - * + * * + * + * + * + * + * *
score(q,d) =
- * Σ - * ( {@link #tf(int) tf}(t in d) * - * {@link #idf(Term,Searcher) idf}(t)^2 * - * {@link Query#getBoost getBoost}(t in q) * - * {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) * - * {@link #lengthNorm(String,int) lengthNorm}(t.field in d) ) - *  * - * {@link #coord(int,int) coord}(q,d) * - * {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights) + * + * score(q,d)   =   + * coord(q,d)  ·  + * queryNorm(q)  ·  *
- * t in q + * + * * + * ( + * tf(t in d)  ·  + * idf(t)2  ·  + * t.getBoost() ·  + * norm(t,d) + * ) + *
t in q
+ *
* *

where + *

    + *
  1. + * + * tf(t in d) + * correlates to the term's frequency, + * defined as the number of times term t appears in the currently scored document d. + * Documents that have more occurrences of a given term receive a higher score. + * The default computation for tf(t in d) in + * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) DefaultSimilarity} is: * - * - * - * - * - * - * - * - * - * - *
    sumOfSqaredWeights =
    - * Σ - * ( {@link #idf(Term,Searcher) idf}(t) * - * {@link Query#getBoost getBoost}(t in q) )^2 - *
    - * t in q - *
    + *
     
    + * + * + * + * + * + *
    + * {@link org.apache.lucene.search.DefaultSimilarity#tf(float) tf(t in d)}   =   + * + * frequency½ + *
    + *
     
    + *
  2. * - *

    Note that the above formula is motivated by the cosine-distance or dot-product - * between document and query vector, which is implemented by {@link DefaultSimilarity}. - * + *

  3. + * + * idf(t) stands for Inverse Document Frequency. This value + * correlates to the inverse of docFreq + * (the number of documents in which the term t appears). + * This means rarer terms give higher contribution to the total score. + * The default computation for idf(t) in + * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is: + * + *
     
    + * + * + * + * + * + * + * + *
    + * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) idf(t)}  =   + * + * 1 + log ( + * + * + * + * + * + *
    numDocs
    –––––––––
    docFreq+1
    + *
    + * ) + *
    + *
     
    + *
  4. + * + *
  5. + * + * coord(q,d) + * is a score factor based on how many of the query terms are found in the specified document. + * Typically, a document that contains more of the query's terms will receive a higher score + * than another document with fewer query terms. + * This is a search time factor computed in + * {@link #coord(int, int) coord(q,d)} + * by the Similarity in effect at search time. + *
     
    + *
  6. + * + *
  7. + * + * queryNorm(q) + * + * is a normalizing factor used to make scores between queries comparable. + * This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), + * but rather just attempts to make scores from different queries (or even different indexes) comparable. + * This is a search time factor computed by the Similarity in effect at search time. + * + * The default computation in + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity} + * is: + *
     
    + * + * + * + * + * + *
    + * queryNorm(q)   =   + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm(sumOfSquaredWeights)} + *   =   + * + * + * + * + * + *
    1
    + * –––––––––––––– + *
    sumOfSquaredWeights½
    + *
    + *
     
    + * + * The sum of squared weights (of the query terms) is + * computed by the query {@link org.apache.lucene.search.Weight} object. + * For example, a {@link org.apache.lucene.search.BooleanQuery boolean query} + * computes this value as: + * + *
     
    + * + * + * + * + * + * + * + * + * + * + * + *
    + * {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights}   =   + * {@link org.apache.lucene.search.Query#getBoost() q.getBoost()} 2 + *  ·  + * + * + * + * ( + * idf(t)  ·  + * t.getBoost() + * ) 2 + *
    t in q
    + *
     
    + * + *
  8. + * + *
  9. + * + * t.getBoost() + * is a search time boost of term t in the query q as + * specified in the query text + * (see query syntax), + * or as set by application calls to + * {@link org.apache.lucene.search.Query#setBoost(float) setBoost()}. + * Notice that there is really no direct API for accessing a boost of one term in a multi term query, + * but rather multi terms are represented in a query as multi + * {@link org.apache.lucene.search.TermQuery TermQuery} objects, + * and so the boost of a term in the query is accessible by calling the sub-query + * {@link org.apache.lucene.search.Query#getBoost() getBoost()}. + *
     
    + *
  10. + * + *
  11. + * + * norm(t,d) encapsulates a few (indexing time) boost and length factors: + * + * + * + *

    + * When a document is added to the index, all the above factors are multiplied. + * If the document has multiple fields with the same name, all their boosts are multiplied together: + * + *
     
    + * + * + * + * + * + * + * + * + * + * + * + *
    + * norm(t,d)   =   + * {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()} + *  ·  + * {@link #lengthNorm(String, int) lengthNorm(field)} + *  ·  + * + * + * + * {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}() + *
    field f in d named as t
    + *
     
    + * However the resulted norm value is {@link #encodeNorm(float) encoded} as a single byte + * before being stored. + * At search time, the norm byte value is read from the index + * {@link org.apache.lucene.store.Directory directory} and + * {@link #decodeNorm(byte) decoded} back to a float norm value. + * This encoding/decoding, while reducing index size, comes with the price of + * precision loss - it is not guaranteed that decode(encode(x)) = x. + * For instance, decode(encode(0.89)) = 0.75. + * Also notice that search time is too late to modify this norm part of scoring, e.g. by + * using a different {@link Similarity} for search. + *
     
    + *

  12. + *
+ * * @see #setDefault(Similarity) * @see IndexWriter#setSimilarity(Similarity) * @see Searcher#setSimilarity(Similarity)