Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
Today all sim formulas have to be "hacked" to deal with the fact that they may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with spans and there is even a dedicated test for it. All formulas have hacks such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:
Instead of:
expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();
they must do tricks such as:
expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());
There is no good reason for this, it is just sloppiness in the Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't pass terms that dont exist or bogus statistics.
It adds a lot of complexity to the scoring api and makes it difficult to have meaningful/useful explanations, to debug problems, etc. It also makes it really hard to add a new sim.