Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
From a high level, BM25FQuery works as follows:
- Given a list of fields and weights, it pretends there's a synthetic combined field where all terms have been indexed. It computes new term and collection statistics for this combined field.
- It uses a disjunction iterator and BM25Similarity to score the documents.
The steps are (1) compute statistics that represent the combined field content, and (2) pass these to a similarity function. There is nothing really specific to BM25Similarity in this approach. In step 2, we could use another similarity, for example BooleanSimilarity or those based on language models like LMDirichletSimilarity. The main restriction is that norms have to be additive (the norm of the combined field must be the sum of the field norms).
Maybe we could unhardcode BM25Similarity in BM25FQuery and instead use the one configured on IndexSearcher. We could think of this as providing a sensible default approach to cross-field scoring for many similarities. It's an incremental step towards LUCENE-8711, which would give similarities more fine-grained control over how stats/ scores are combined across fields.
Attachments
Issue Links
- links to