Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9848

Correctly sort HNSW graph neighbors when applying diversity criterion

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.2
    • None
    • None
    • New

    Description

      When indexing new documents in an HNSW graph, we first find its nearest maxConn neighbors (using HNSW search), and then link the new document to this neighbors in the graph. These neighbors are filtered using a diversity test. The neighbors are added one by one, from most similar to least. Each new neighbor is checked against all prior (better) neighbors, and if it is more similar to that neighbor than it is to the target document, it is rejected as insufficiently diverse.

      When we applied this diversity criterion (rather than simply picking the k nearest neighbors), we saw substantial improvements in recall / latency ROC curves across several data sets, and it is part of the reference implementation, too (where we got it). I believe the impact on indexing performance was relatively small; this is a good thing to do, even though it is n^2 at its heart, the n remains reasonable due to being bounded by the maximum graph fanout parameter, maxConn

      Something funny happens when we reach the maximum fanout though. While a new document is being linked to its new neighbors, the neighbors are reciprocally linked to the new document, until their maximum fanout is reached. At that point, the diversity criterion is reapplied to select the neighbors to keep. Basically every neighbor is re-checked against every earlier (better) neighbor to verify the diversity criterion.  This is needed because we haven't really maintained the diversity property while adding these reciprocal links – the initial neighbors are checked for diversity, which often leads to fewer than maxConn of them being added. Then the new documents get linked in without checking, until maxConn is reached, and then diversity is checked again. This is kind of weird, but seems to work.

      But the really strange thing is that when we reject non-diverse documents (in HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in nearness order. I did some rough checks to see if better graphs would result from re-sorting (so that when there are non-diverse neighbors, we always prefer to drop the worse-scoring one), but it didn't seem to matter all that much. But how can that be?

      At any rate this code is funky and hard to understand, and it would probably benefit from a second look to see if we can either improve indexing performance or improve search performance (by producing better graphs during indexing).

      Attachments

        Issue Links

          Activity

            People

              mayya Mayya Sharipova
              sokolov Michael Sokolov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h