Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
When indexing new documents in an HNSW graph, we first find its nearest maxConn neighbors (using HNSW search), and then link the new document to this neighbors in the graph. These neighbors are filtered using a diversity test. The neighbors are added one by one, from most similar to least. Each new neighbor is checked against all prior (better) neighbors, and if it is more similar to that neighbor than it is to the target document, it is rejected as insufficiently diverse.
When we applied this diversity criterion (rather than simply picking the k nearest neighbors), we saw substantial improvements in recall / latency ROC curves across several data sets, and it is part of the reference implementation, too (where we got it). I believe the impact on indexing performance was relatively small; this is a good thing to do, even though it is n^2 at its heart, the n remains reasonable due to being bounded by the maximum graph fanout parameter, maxConn.
Something funny happens when we reach the maximum fanout though. While a new document is being linked to its new neighbors, the neighbors are reciprocally linked to the new document, until their maximum fanout is reached. At that point, the diversity criterion is reapplied to select the neighbors to keep. Basically every neighbor is re-checked against every earlier (better) neighbor to verify the diversity criterion. This is needed because we haven't really maintained the diversity property while adding these reciprocal links – the initial neighbors are checked for diversity, which often leads to fewer than maxConn of them being added. Then the new documents get linked in without checking, until maxConn is reached, and then diversity is checked again. This is kind of weird, but seems to work.
But the really strange thing is that when we reject non-diverse documents (in HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in nearness order. I did some rough checks to see if better graphs would result from re-sorting (so that when there are non-diverse neighbors, we always prefer to drop the worse-scoring one), but it didn't seem to matter all that much. But how can that be?
At any rate this code is funky and hard to understand, and it would probably benefit from a second look to see if we can either improve indexing performance or improve search performance (by producing better graphs during indexing).
Attachments
Issue Links
- links to