Index: src/java/org/apache/lucene/search/Similarity.java =================================================================== --- src/java/org/apache/lucene/search/Similarity.java (revision 447480) +++ src/java/org/apache/lucene/search/Similarity.java (working copy) @@ -28,56 +28,249 @@ /** Expert: Scoring API. *

Subclasses implement search scoring. + * + *

The score of query q for document d correlates to the + * cosine-distance or dot-product between document and query vectors in a + * + * Vector Space Model (VSM) of Information Retrieval. + * A document whose vector is closer to the query vector in that model is scored higher. + * + *

The score is computed as follows: * - *

The score of query q for document d is defined - * in terms of these methods as follows: - * - * + *
* - * - * - * - * - * - * - * + * * + * + * + * + * + * *
score(q,d) =
- * Σ - * ( {@link #tf(int) tf}(t in d) * - * {@link #idf(Term,Searcher) idf}(t)^2 * - * {@link Query#getBoost getBoost}(t in q) * - * {@link org.apache.lucene.document.Field#getBoost getBoost}(t.field in d) * - * {@link #lengthNorm(String,int) lengthNorm}(t.field in d) ) - *  * - * {@link #coord(int,int) coord}(q,d) * - * {@link #queryNorm(float) queryNorm}(sumOfSqaredWeights) + * + * score(q,d)   =   + * {@link #coord(int,int) coord}(q,d)  ·  + * {@link #queryNorm(float) normalizer}(q)  ·  *
- * t in q + * + * * + * ( + * {@link #tf(int) tf}(t in d)  ·  + * {@link #idf(Term,Searcher) idf}(t)2  ·  + * {@link Query#getBoost searchBoost}(t in q)  ·  + * indexBoost(t,d) + * ) + *
t in q
* *

where + *

    + *
  1. + * coord(q,d) + * + * is a score factor based on how many of the query terms are found in the specified document. + * Typically, a document that contains more of the query's terms will receive a higher score + * than another document with fewer query terms. + * This is a search time factor computed in + * coord(q,d) + * by the Similarity in effect at search time. + *
     
    + *
  2. + *
  3. + * normalizer(q) + * + * is a normalizing factor used to make scores between queries comparable. + * This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), + * but rather just attempts to make scores from different queries (or even different indexes) comparable. + * This is a search time factor computed in + * {@link #queryNorm(float) queryNorm(sumOfSquaredWeights)} + * by the Similarity in effect at search time. + * + * The default computation for normalizer(q) in + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity} + * is: + *
     
    + * + * + * + * + * + *
    + * normalizer(q)   =   + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) queryNorm (sumOfSquaredWeights)} + *   =   + * + * + * + * + * + *
    1
    + * –––––––––––––– + *
    sumOfSquaredWeights½
    + *
    + *
     
    + * + * The sum of squared weights (of the query terms) is + * computed by the query {@link org.apache.lucene.search.Weight} object. + * For example, a {@link org.apache.lucene.search.BooleanQuery boolean query} + * computes this value as: + * + *
     
    + * + * + * + * + * + * + * + * + * + * + * + *
    + * {@link org.apache.lucene.search.Weight#sumOfSquaredWeights() sumOfSquaredWeights}   =   + * {@link org.apache.lucene.search.Query#getBoost() searchBoost(q)} 2 + *  ·  + * + * + * + * ( + * {@link #idf(Term,Searcher) idf}(t)  ·  + * {@link Query#getBoost searchBoost}(t in q) + * ) 2 + *
    t in q
    + *
     
    + * + *
  4. + * + *
  5. + * tf(t in d) + * correlates t the term's frequency + * - the number of times term t appears in the current document d being scored. + * Documents that have more occurrences of a given term receive a higher score. + * The default computation for tf(t in d) in + * {@link org.apache.lucene.search.DefaultSimilarity#queryNorm(float) DefaultSimilarity} is: + * + *
     
    + * + * + * + * + * + *
    + * tf(t in d)  =   + * + * frequency½ + *
    + *
     
    + *
  6. + * + *
  7. + * idf(t) + * - Inverse Document Frequency - correlates to the inverse of docFreq (the number of documents in + * which the term t appears). This means rarer terms give higher contribution to + * the total score. + * The default computation for idf(t) in + * {@link org.apache.lucene.search.DefaultSimilarity#idf(int, int) DefaultSimilarity} is: + * + *
     
    + * + * + * + * + * + * + * + *
    + * idf(t)  =   + * + * 1 + log ( + * + * + * + * + * + *
    numDocs
    –––––––––
    docFreq+1
    + *
    + * ) + *
    + *
     
    + *
  8. + * + *
  9. + * {@link org.apache.lucene.search.Query#getBoost() searchBoost(t in q),searchBoost(q)} + * are search time boosts of a query set by application calls to + * {@link org.apache.lucene.search.Query#setBoost(float) setBoost(float)}. + * Notice that there is really no API for setting a boost of one term in a multi term query, + * but rather multi terms are represented in a query as multi + * {@link org.apache.lucene.search.TermQuery TermQuery} objects, + * and so the boost of a term in the query is accessible via the + * {@link org.apache.lucene.search.Query#getBoost() subQuery.getBoost()}. + *
     
    + *
  10. + * + *
  11. + * indexBoost(t,d) + * is a boost for term t in document d that was set at indexing time. + * At search time it would be too late to modify this part of the scoring. + * A few factors come into play here, accounting for fields named the same as the term t: + * + * + * + * When a document is added to the index, all the above factors are multiplied. + * If the document has multiple fields with the same name, all their boosts are multiplied together: + * + *
     
    + * + * + * + * + * + * + * + * + * + * + * + *
    + * indexBoost(t in d)   =   + * {@link org.apache.lucene.document.Document#getBoost() doc.getBoost()} + *  ·  + * {@link #lengthNorm(String, int) lengthNorm(field)} + *  ·  + * + * + * + * {@link org.apache.lucene.document.Fieldable#getBoost() f.getBoost}() + *
    field f in d named as t
    + *
     
    + * However the resulted float boost is {@link #encodeNorm(float) encoded} as a single byte, + * and stored in the index as norm. + * At search time, the norm byte value is read from disk and + * {@link #decodeNorm(byte) decoded} to a float indexBoost. + * This encoding/decoding, while reducing index size, comes with the price of + * precision loss - it is not guaranteed that decode(encode(x)) = x. + * For instance, decode(encode(0.89)) = 0.75. + *
     
    + *
  12. + * + *
- * - * sumOfSqaredWeights =
- * - * Σ - * - * ( {@link #idf(Term,Searcher) idf}(t) * - * {@link Query#getBoost getBoost}(t in q) )^2 - * - * - * - * - * t in q - * - * - * - * - *

Note that the above formula is motivated by the cosine-distance or dot-product - * between document and query vector, which is implemented by {@link DefaultSimilarity}. - * * @see #setDefault(Similarity) * @see IndexWriter#setSimilarity(Similarity) * @see Searcher#setSimilarity(Similarity) Index: xdocs/scoring.xml =================================================================== --- xdocs/scoring.xml (revision 447480) +++ xdocs/scoring.xml (working copy) @@ -1,307 +1,354 @@ - - - - - Grant Ingersoll - Scoring - Apache Lucene - - - - -

-

Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. - In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to - work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms - scores lower than a different document with only one of the query terms.

-

While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can - help you figure out the what and why of Lucene scoring.

-

Lucene scoring uses a combination of the - Vector Space Model (VSM) of Information - Retrieval and the Boolean model - to determine - how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more - times a query term appears in a document relative to - the number of times the term appears in all the documents in the collection, the more relevant that - document is to the query. It uses the Boolean model to first narrow down the documents that need to - be scored based on the use of boolean logic in the Query specification. Lucene also adds some - capabilities and refinements onto this model to support boolean and fuzzy searching, but it - essentially remains a VSM based system at the heart. - For some valuable references on VSM and IR in general refer to the - Lucene Wiki IR references. -

-

The rest of this document will cover Scoring basics and how to change your - Similarity. Next it will cover ways you can - customize the Lucene internals in Changing your Scoring - -- Expert Level which gives details on implementing your own - Query class and related functionality. Finally, we - will finish up with some reference material in the Appendix. -

-
-
-

Scoring is very much dependent on the way documents are indexed, - so it is important to understand indexing (see - Apache Lucene - Getting Started Guide - and the Lucene - file formats - before continuing on with this section.) It is also assumed that readers know how to use the - Searcher.explain(Query query, int doc) functionality, - which can go a long way in informing why a score is returned. -

- -

In Lucene, the objects we are scoring are - Documents. A Document is a collection - of - Fields. Each Field has semantics about how - it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to - note that Lucene scoring works on Fields and then combines the results to return Documents. This is - important because two Documents with the exact same content, but one having the content in two Fields - and the other in one Field will return different scores for the same query due to length normalization - (assumming the - DefaultSimilarity - on the Fields). -

-
- -

- Lucene's scoring formula computes the score of one document d for a given query q across each - term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more - relevant document d is to the query q. This is taken from - Similarity: - -

- - score(q,d) = - - sum t in q( - tf - (t in d) * - idf - (t)^2 * - - getBoost - - (t in q) * - getBoost - (t.field in d) * - - lengthNorm - - (t.field in d) ) * - - coord - - (q,d) * - - queryNorm - (sumOfSquaredWeights) -
-

-

- where - -

- sumOfSquaredWeights = - sumt in q( - - idf - - (t) * - - getBoost - - (t in q) )^2 -
-

-

- This scoring formula is mostly implemented in the - TermScorer class, where it makes calls to the - Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation: -

    - -
  1. tf(t in d) - Term Frequency - The number of times the term t appears in the current document d being scored. Documents that have more occurrences of a given term receive a higher score.
  2. - -
  3. idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.

  4. - -
  5. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.

  6. - -
  7. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.

  8. - -
  9. coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.

  10. - -
  11. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable - GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) - that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem - to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

  12. -
- Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided - for context and are not authoratitive. -

-
- -

OK, so the tf-idf formula and the - Similarity - is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are - the use and interactions between the - Query classes, as created by each application in - response to a user's information need. -

-

In this regard, Lucene offers a wide variety of Query implementations, most of which are in the - org.apache.lucene.search package. - These implementations can be combined in a wide variety of ways to provide complex querying - capabilities along with - information about where matches took place in the document collection. The Query - section below - highlights some of the more important Query classes. For information on the other ones, see the - package summary. For details on implementing - your own Query class, see Changing your Scoring -- - Expert Level below. -

-

Once a Query has been created and submitted to the - IndexSearcher, the scoring process - begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, - control finally passes to the Weight implementation and its - Scorer instance. In the case of any type of - BooleanQuery, scoring is handled by the - BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), - unless the static - - BooleanQuery#setUseScorer14(boolean) method is set to true, - in which case the - BooleanWeight - (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default. - See CHANGES.txt under release 1.9 RC1 for more information on choosing which Scorer to use. -

-

- Assuming the use of the BooleanWeight2, a - BooleanScorer2 is created by bringing together - all of the - Scorers from the sub-clauses of the BooleanQuery. - When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type - of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores - provided by each scorer while factoring in the coord() score. - -

-
- -

For information on the Query Classes, refer to the - search package javadocs -

-
- -

One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on - how to do this, see the - search package javadocs

-
- -
-
-

At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more - about how to do this, refer to the - search package javadocs -

-
- -
- -

- - Karl Wettin's UML on the Wiki -

-
- -

FILL IN HERE. Volunteers?

-
- -

GSI Note: This section is mostly my notes on stepping through the Scoring process and serves as - fertilizer for the earlier sections.

-

In the typical search application, a - Query - is passed to the - Searcher - , beginning the scoring process. -

-

Once inside the Searcher, a - Hits - object is constructed, which handles the scoring and caching of the search results. - The Hits constructor stores references to three or four important objects: -

    -
  1. The - Weight - object of the Query. The Weight object is an internal representation of the Query that - allows the Query to be reused by the Searcher. -
  2. -
  3. The Searcher that initiated the call.
  4. -
  5. A - Filter - for limiting the result set. Note, the Filter may be null. -
  6. -
  7. A - Sort - object for specifying how to sort the results if the standard score based sort method is not - desired. -
  8. -
-

-

Now that the Hits object has been initialized, it begins the process of identifying documents that - match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't - effect the raw Lucene score), - we call on the "expert" search method of the Searcher, passing in our - Weight - object, - Filter - and the number of results we want. This method - returns a - TopDocs - object, which is an internal collection of search results. - The Searcher creates a - TopDocCollector - and passes it along with the Weight, Filter to another expert search method (for more on the - HitCollector - mechanism, see - Searcher - .) The TopDocCollector uses a - PriorityQueue - to collect the top results for the search. -

-

If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, - we ask the Weight for - a - Scorer - for the - IndexReader - of the current searcher and we proceed by - calling the score method on the - Scorer - . -

-

At last, we are actually going to score some documents. The score method takes in the HitCollector - (most likely the TopDocCollector) and does its business. - Of course, here is where things get involved. The - Scorer - that is returned by the - Weight - object depends on what type of Query was submitted. In most real world applications with multiple - query terms, - the - Scorer - is going to be a - BooleanScorer2 - (see the section on customizing your scoring for info on changing this.) - -

-

Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the - coord() factor. We then - get a internal Scorer based on the required, optional and prohibited parts of the query. - Using this internal Scorer, the BooleanScorer2 then proceeds - into a while loop based on the Scorer#next() method. The next() method advances to the next document - matching the query. This is an - abstract method in the Scorer class and is thus overriden by all derived - implementations. If you have a simple OR query - your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers - from the sub scorers of the OR'd terms.

-
-
- + + + + + Grant Ingersoll + Scoring - Apache Lucene + + + + +
+

Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. + In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to + work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms + scores lower than a different document with only one of the query terms.

+

While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can + help you figure out the what and why of Lucene scoring.

+

Lucene scoring uses a combination of the + Vector Space Model (VSM) of Information + Retrieval and the Boolean model + to determine + how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more + times a query term appears in a document relative to + the number of times the term appears in all the documents in the collection, the more relevant that + document is to the query. It uses the Boolean model to first narrow down the documents that need to + be scored based on the use of boolean logic in the Query specification. Lucene also adds some + capabilities and refinements onto this model to support boolean and fuzzy searching, but it + essentially remains a VSM based system at the heart. + For some valuable references on VSM and IR in general refer to the + Lucene Wiki IR references. +

+

The rest of this document will cover Scoring basics and how to change your + Similarity. Next it will cover ways you can + customize the Lucene internals in Changing your Scoring + -- Expert Level which gives details on implementing your own + Query class and related functionality. Finally, we + will finish up with some reference material in the Appendix. +

+
+
+

Scoring is very much dependent on the way documents are indexed, + so it is important to understand indexing (see + Apache Lucene - Getting Started Guide + and the Lucene + file formats + before continuing on with this section.) It is also assumed that readers know how to use the + Searcher.explain(Query query, int doc) functionality, + which can go a long way in informing why a score is returned. +

+ +

In Lucene, the objects we are scoring are + Documents. A Document is a collection + of + Fields. Each Field has semantics about how + it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to + note that Lucene scoring works on Fields and then combines the results to return Documents. This is + important because two Documents with the exact same content, but one having the content in two Fields + and the other in one Field will return different scores for the same query due to length normalization + (assumming the + DefaultSimilarity + on the Fields). +

+
+ +

Lucene allows influencing search results by "boosting" in more than one level: +

    +
  • Document level boosting + - while indexing - by calling + document.setBoost() + before a document is added to the index. +
  • +
  • Document's Field level boosting + - while indexing - by calling + field.setBoost() + before adding a field to the document (and before adding the document to the index). +
  • +
  • Query level boosting + - during search, by setting a boost on a query clause, calling + Query.setBoost(). +
  • +
+

+

Indexing time boosts are preprocessed for storage efficiency and written to + the directory (when writing the document) in a single byte (!) as follows: + For each field of a document, all boosts of that field + (i.e. all boosts under the same field name in that doc) are multiplied. + The result is multiplied by the boost of the document, + and also multiplied by a "field length norm" value + that represents the length of that field in that doc + (so shorter fields are automatically boosted up). + The result is decoded as a single byte + (with some precision loss of course) and stored in the directory. + The similarity object in effect at indexing computes the length-norm of the field. +

+

This composition of 1-byte representation of norms + (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) + is nicely described in + Fieldable.setBoost(). +

+

Encoding and decoding of the resulted float norm in a single byte are done by the + static methods of the class Similarity: + encodeNorm() and + decodeNorm(). + Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, + e.g. decode(encode(0.89)) = 0.75. + At scoring (search) time, this norm is brought into the score of document + as indexBoost, as shown by the formula in + Similarity. +

+
+ +

+ Lucene's scoring formula computes the score of one document d for a given query q across each + term t that occurs in q. The score attempts to measure relevance, so the higher the score, the more + relevant document d is to the query q. This is taken from + Similarity: + +

+ + score(q,d) = + + sum t in q( + tf + (t in d) * + idf + (t)^2 * + + getBoost + + (t in q) * + getBoost + (t.field in d) * + + lengthNorm + + (t.field in d) ) * + + coord + + (q,d) * + + queryNorm + (sumOfSquaredWeights) +
+

+

+ where + +

+ sumOfSquaredWeights = + sumt in q( + + idf + + (t) * + + getBoost + + (t in q) )^2 +
+

+

+ This scoring formula is mostly implemented in the + TermScorer class, where it makes calls to the + Similarity class to retrieve values for the following. Note that the descriptions apply to DefaultSimilarity implementation: +

    + +
  1. tf(t in d) - Term Frequency - The number of times the term t appears in the current document d being scored. Documents that have more occurrences of a given term receive a higher score.
  2. + +
  3. idf(t) - Inverse Document Frequency - One divided by the number of documents in which the term t appears. This means rarer terms give higher contribution to the total score.

  4. + +
  5. getBoost(t in q) - The boost, specified in the query by the user, that should be applied to this term. A boost over 1.0 will increase the importance of this term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the default boost) has no effect.

  6. + +
  7. lengthNorm(t.field in q) - The factor to apply to account for differing lengths in the fields that are being searched. Typically longer fields return a smaller value. This means matches against shorter fields receive a higher score than matches against longer fields.

  8. + +
  9. coord(q, d) - Score factor based on how many terms the specified document has in common with the query. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms.

  10. + +
  11. queryNorm(sumOfSquaredWeights) - Factor used to make scores between queries comparable + GSI: might be interesting to have a note on why this formula was chosen. I have always understood (but not 100% sure) + that it is not a good idea to compare scores across queries or indexes, so any use of normalization may lead to false assumptions. However, I also seem + to remember some research on using sum of squares as being somewhat suitable for score comparison. Anyone have any thoughts here?

  12. +
+ Note, the above definitions are summaries of the javadocs which can be accessed by clicking the links in the formula and are merely provided + for context and are not authoratitive. +

+
+ +

OK, so the tf-idf formula and the + Similarity + is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are + the use and interactions between the + Query classes, as created by each application in + response to a user's information need. +

+

In this regard, Lucene offers a wide variety of Query implementations, most of which are in the + org.apache.lucene.search package. + These implementations can be combined in a wide variety of ways to provide complex querying + capabilities along with + information about where matches took place in the document collection. The Query + section below + highlights some of the more important Query classes. For information on the other ones, see the + package summary. For details on implementing + your own Query class, see Changing your Scoring -- + Expert Level below. +

+

Once a Query has been created and submitted to the + IndexSearcher, the scoring process + begins. (See the Appendix Algorithm section for more notes on the process.) After some infrastructure setup, + control finally passes to the Weight implementation and its + Scorer instance. In the case of any type of + BooleanQuery, scoring is handled by the + BooleanWeight2 (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class), + unless the static + + BooleanQuery#setUseScorer14(boolean) method is set to true, + in which case the + BooleanWeight + (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class) from the 1.4 version of Lucene is used by default. + See CHANGES.txt under release 1.9 RC1 for more information on choosing which Scorer to use. +

+

+ Assuming the use of the BooleanWeight2, a + BooleanScorer2 is created by bringing together + all of the + Scorers from the sub-clauses of the BooleanQuery. + When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type + of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores + provided by each scorer while factoring in the coord() score. + +

+
+ +

For information on the Query Classes, refer to the + search package javadocs +

+
+ +

One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on + how to do this, see the + search package javadocs

+
+ +
+
+

At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more + about how to do this, refer to the + search package javadocs +

+
+ +
+ +

+ + Karl Wettin's UML on the Wiki +

+
+ +

FILL IN HERE. Volunteers?

+
+ +

GSI Note: This section is mostly my notes on stepping through the Scoring process and serves as + fertilizer for the earlier sections.

+

In the typical search application, a + Query + is passed to the + Searcher + , beginning the scoring process. +

+

Once inside the Searcher, a + Hits + object is constructed, which handles the scoring and caching of the search results. + The Hits constructor stores references to three or four important objects: +

    +
  1. The + Weight + object of the Query. The Weight object is an internal representation of the Query that + allows the Query to be reused by the Searcher. +
  2. +
  3. The Searcher that initiated the call.
  4. +
  5. A + Filter + for limiting the result set. Note, the Filter may be null. +
  6. +
  7. A + Sort + object for specifying how to sort the results if the standard score based sort method is not + desired. +
  8. +
+

+

Now that the Hits object has been initialized, it begins the process of identifying documents that + match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't + effect the raw Lucene score), + we call on the "expert" search method of the Searcher, passing in our + Weight + object, + Filter + and the number of results we want. This method + returns a + TopDocs + object, which is an internal collection of search results. + The Searcher creates a + TopDocCollector + and passes it along with the Weight, Filter to another expert search method (for more on the + HitCollector + mechanism, see + Searcher + .) The TopDocCollector uses a + PriorityQueue + to collect the top results for the search. +

+

If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, + we ask the Weight for + a + Scorer + for the + IndexReader + of the current searcher and we proceed by + calling the score method on the + Scorer + . +

+

At last, we are actually going to score some documents. The score method takes in the HitCollector + (most likely the TopDocCollector) and does its business. + Of course, here is where things get involved. The + Scorer + that is returned by the + Weight + object depends on what type of Query was submitted. In most real world applications with multiple + query terms, + the + Scorer + is going to be a + BooleanScorer2 + (see the section on customizing your scoring for info on changing this.) + +

+

Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the + coord() factor. We then + get a internal Scorer based on the required, optional and prohibited parts of the query. + Using this internal Scorer, the BooleanScorer2 then proceeds + into a while loop based on the Scorer#next() method. The next() method advances to the next document + matching the query. This is an + abstract method in the Scorer class and is thus overriden by all derived + implementations. If you have a simple OR query + your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers + from the sub scorers of the OR'd terms.

+
+
+
\ No newline at end of file