Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9725

Allow BM25FQuery to use other similarities

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 8.9
    • None
    • None
    • New

    Description

      From a high level, BM25FQuery works as follows:

      1. Given a list of fields and weights, it pretends there's a synthetic combined field where all terms have been indexed. It computes new term and collection statistics for this combined field.
      2. It uses a disjunction iterator and BM25Similarity to score the documents.

      The steps are (1) compute statistics that represent the combined field content, and (2) pass these to a similarity function. There is nothing really specific to BM25Similarity in this approach. In step 2, we could use another similarity, for example BooleanSimilarity or those based on language models like LMDirichletSimilarity. The main restriction is that norms have to be additive (the norm of the combined field must be the sum of the field norms).

      Maybe we could unhardcode BM25Similarity in BM25FQuery and instead use the one configured on IndexSearcher. We could think of this as providing a sensible default approach to cross-field scoring for many similarities. It's an incremental step towards LUCENE-8711, which would give similarities more fine-grained control over how stats/ scores are combined across fields.

      Attachments

        Issue Links

          Activity

            People

              julietibs Julie Tibshirani
              julietibs Julie Tibshirani
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m