Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8020

Don't force sim to score bogus terms (e.g. docfreq=0)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today all sim formulas have to be "hacked" to deal with the fact that they may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with spans and there is even a dedicated test for it. All formulas have hacks such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:

      Instead of:

      expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();
      

      they must do tricks such as:

      expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());
      

      There is no good reason for this, it is just sloppiness in the Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't pass terms that dont exist or bogus statistics.

      It adds a lot of complexity to the scoring api and makes it difficult to have meaningful/useful explanations, to debug problems, etc. It also makes it really hard to add a new sim.

        Attachments

        1. LUCENE-8020.patch
          28 kB
          Robert Muir
        2. LUCENE-8020.patch
          24 kB
          Robert Muir
        3. LUCENE-8020.patch
          21 kB
          Robert Muir

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rcmuir Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: