Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
Since Lucene 4.0, the statistics needed for this are always present, so we can make the change without any degradation.
I think the change should be a 6.0 change only: it will prevent any surprises. DefaultSimilarity is renamed to ClassicSimilarity to prevent confusion. No indexing change is needed as we use the same norm format, its just a runtime switch. Users can just do IndexSearcher.setSimilarity(new ClassicSimilarity()) to get the old behavior. I did not change solr's default here, I think that should be a separate issue, since it has more concerns: e.g. factories in configuration files and so on.
One issue was the generation of synonym queries (posinc=0) by QueryBuilder (used by parsers). This is kind of a corner case (query-time synonyms), but we should make it nicer. The current code in trunk disables coord, which makes no sense for anything but the vector space impl. Instead, this patch adds a SynonymQuery which treats occurrences of any term as a single pseudoterm. With english wordnet as a query-time synonym dict, this query gives 12% improvement in MAP for title queries on BM25, and 2% with Classic (not significant). So its a better generic approach for synonyms that works with all scoring models.
I wanted to use BlendedTermQuery, but it seems to have problems at a glance, it tries to "take on the world", it has problems like not working with distributed scoring (doesn't consult indexsearcher for stats). Anyway this one is a different, simpler approach, which only works for a single field, and which calls tf(sum) a single time.
Attachments
Attachments
Issue Links
- is related to
-
SOLR-8057 Change default Sim to BM25 (w/backcompat config handling)
- Resolved
-
LUCENE-6887 5x: backport ClassicSimilarity, mark DefaultSimilarity deprecated & update javadocs to mention ClassicSim vs. BM25Sim
- Closed