[LUCENE-10593] VectorSimilarityFunction reverse removal - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.3
Component/s: None
Labels:
- vector-based-search

Lucene Fields:

New

Description

org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves in an opposite way in comparison to the other similarities:
A higher similarity score means higher distance, for this reason, has been marked with "reversed" and a function is present to map from the similarity to a score (where higher means closer, like in all other similarities.)

Having this counterintuitive behavior with no apparent explanation I could find(please correct me if I am wrong) brings a lot of nasty side effects for the code readability, especially when combined with the NeighbourQueue that has a "reversed" itself.
In addition, it complicates also the usage of the pattern:
Result Queue -> MIN HEAP
Candidate Queue -> MAX HEAP
In HNSW searchers.

The proposal in my Pull Request aims to:

1) the Euclidean similarity just returns the score, in line with the other similarities, with the formula currently used to move from distance to score

2) simplify the code, removing the bound checker that's not necessary anymore

3) refactor here and there to be in line with the simplification

4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, now debugging is much easier and understanding the HNSW code is much more intuitive

Attachments

Activity

People

Assignee:: Alessandro Benedetti

Reporter:: Alessandro Benedetti

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/May/22 20:57

Updated:: 27/Sep/22 09:16

Resolved:: 29/Jun/22 09:39