[LUCENE-8020] Don't force sim to score bogus terms (e.g. docfreq=0) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Today all sim formulas have to be "hacked" to deal with the fact that they may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with spans and there is even a dedicated test for it. All formulas have hacks such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:

Instead of:

expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();

they must do tricks such as:

expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());

There is no good reason for this, it is just sloppiness in the Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't pass terms that dont exist or bogus statistics.

It adds a lot of complexity to the scoring api and makes it difficult to have meaningful/useful explanations, to debug problems, etc. It also makes it really hard to add a new sim.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-8020.patch
28/Oct/17 19:50
28 kB
Robert Muir
LUCENE-8020.patch
28/Oct/17 18:58
24 kB
Robert Muir
LUCENE-8020.patch
28/Oct/17 18:15
21 kB
Robert Muir

Activity

People

Assignee:: Unassigned

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Oct/17 17:57

Updated:: 28/Aug/22 15:20

Resolved:: 31/Oct/17 00:39