Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8020

Don't force sim to score bogus terms (e.g. docfreq=0)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 8.0
    • None
    • None
    • New

    Description

      Today all sim formulas have to be "hacked" to deal with the fact that they may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with spans and there is even a dedicated test for it. All formulas have hacks such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:

      Instead of:

      expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();
      

      they must do tricks such as:

      expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());
      

      There is no good reason for this, it is just sloppiness in the Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't pass terms that dont exist or bogus statistics.

      It adds a lot of complexity to the scoring api and makes it difficult to have meaningful/useful explanations, to debug problems, etc. It also makes it really hard to add a new sim.

      Attachments

        1. LUCENE-8020.patch
          28 kB
          Robert Muir
        2. LUCENE-8020.patch
          24 kB
          Robert Muir
        3. LUCENE-8020.patch
          21 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: