Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8011

Improve similarity explanations

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 8.0
    • None
    • New

    Description

      LUCENE-7997 improves BM25 and Classic explains to better explain:

      product of:
        2.2 = scaling factor, k1 + 1
        9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
          1.0 = n, number of documents containing term
          17927.0 = N, total number of documents with field
        0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
          979.0 = freq, occurrences of term within document
          1.2 = k1, term saturation parameter
          0.75 = b, length normalization parameter
          1.0 = dl, length of field
          1.0 = avgdl, average length of field
      

      Previously it was pretty cryptic and used confusing terminology like docCount/docFreq without explanation:

      product of:
        0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
          449.0 = docFreq
          456.0 = docCount
        2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
          113659.0 = freq=113658
          1.2 = parameter k1
          0.75 = parameter b
          2300.5593 = avgFieldLength
          1048600.0 = fieldLength
      

      We should fix other similarities too in the same way, they should be more practical.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rcmuir Robert Muir
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: