Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8563

Remove k1+1 from the numerator of BM25Similarity

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Our current implementation of BM25 does

      boost * IDF * (k1+1) * tf / (tf + norm)
      

      As (k1+1) is a constant, it is the same for every term and doesn't modify ordering. It is often omitted and I found out that the "The Probabilistic Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and Zaragova even describes adding (k1+1) to the numerator as a variant whose benefit is to be more comparable with Robertson/Sparck-Jones weighting, which we don't care about.

      A common variant is to add a (k1 + 1) component to the
      numerator of the saturation function. This is the same for all
      terms, and therefore does not affect the ranking produced.
      The reason for including it was to make the final formula
      more compatible with the RSJ weight used on its own

      Should we remove it from BM25Similarity as well?

      A side-effect that I'm interested in is that integrating other score contributions (eg. via oal.document.FeatureField) would be a bit easier to reason about. For instance a weight of 3 in FeatureField#newSaturationQuery would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) rather than a term whose IDF is 3/(k1 + 1).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jpountz Adrien Grand
              • Votes:
                1 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h