[LUCENE-9848] Correctly sort HNSW graph neighbors when applying diversity criterion - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.2
Component/s: None
Labels:
None

Lucene Fields:

New

Description

When indexing new documents in an HNSW graph, we first find its nearest maxConn neighbors (using HNSW search), and then link the new document to this neighbors in the graph. These neighbors are filtered using a diversity test. The neighbors are added one by one, from most similar to least. Each new neighbor is checked against all prior (better) neighbors, and if it is more similar to that neighbor than it is to the target document, it is rejected as insufficiently diverse.

When we applied this diversity criterion (rather than simply picking the k nearest neighbors), we saw substantial improvements in recall / latency ROC curves across several data sets, and it is part of the reference implementation, too (where we got it). I believe the impact on indexing performance was relatively small; this is a good thing to do, even though it is n^2 at its heart, the n remains reasonable due to being bounded by the maximum graph fanout parameter, maxConn.

Something funny happens when we reach the maximum fanout though. While a new document is being linked to its new neighbors, the neighbors are reciprocally linked to the new document, until their maximum fanout is reached. At that point, the diversity criterion is reapplied to select the neighbors to keep. Basically every neighbor is re-checked against every earlier (better) neighbor to verify the diversity criterion. This is needed because we haven't really maintained the diversity property while adding these reciprocal links – the initial neighbors are checked for diversity, which often leads to fewer than maxConn of them being added. Then the new documents get linked in without checking, until maxConn is reached, and then diversity is checked again. This is kind of weird, but seems to work.

But the really strange thing is that when we reject non-diverse documents (in HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in nearness order. I did some rough checks to see if better graphs would result from re-sorting (so that when there are non-diverse neighbors, we always prefer to drop the worse-scoring one), but it didn't seem to matter all that much. But how can that be?

At any rate this code is funky and hard to understand, and it would probably benefit from a second look to see if we can either improve indexing performance or improve search performance (by producing better graphs during indexing).

Attachments

Issue Links

links to

GitHub Pull Request #862

Activity

People

Assignee:: Mayya Sharipova

Reporter:: Michael Sokolov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Mar/21 21:57

Updated:: 28/Aug/22 16:18

Resolved:: 04/May/22 18:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h