Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9937

ann-benchmarks results for HNSW search

Details

    • Task
    • Status: Resolved
    • Minor
    • Resolution: Done
    • None
    • None
    • None
    • None
    • New

    Description

      NOTE: These benchmarks contained an error that made hnswlib perform too well. See this comment for correct results.

      I hooked up our HNSW implementation to ann-benchmarks, a widely used repo for benchmarking nearest neighbor search libraries against large datasets. I found the results interesting and opened this issue to share and discuss. My benchmarking code can be found here – it's hacky and not easy to commit but I’m happy to help anyone get set up with it.

      Approaches

      • LuceneVectorsOnly: a baseline that only indexes vectors, and performs a brute force scan to determine nearest neighbors
      • LuceneHnsw: our HNSW implementation
      • hnswlib: a C++ HNSW implementation from the author of the paper

      Datasets

      • sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing euclidean distance
      • glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing cosine similarity

      Results on sift-128-euclidean
      Parameters: M=16, efConstruction=500

      Approach                Recall  QPS
      LuceneVectorsOnly()     1.000      6.764
      
      LuceneHnsw(n_cands=10)  0.603   7736.968
      LuceneHnsw(n_cands=50)  0.890   3605.000
      LuceneHnsw(n_cands=100) 0.953   2237.429
      LuceneHnsw(n_cands=500) 0.996    570.900
      LuceneHnsw(n_cands=800) 0.998    379.589
      
      hnswlib(n_cands=10)     0.713  69662.756
      hnswlib(n_cands=50)     0.950  28021.582
      hnswlib(n_cands=100)    0.985  16108.538
      hnswlib(n_cands=500)    1.000   4115.886
      hnswlib(n_cands=800)    1.000   2729.680
      

      Results on glove-100-angular
      Parameters: M=32, efConstruction=500

      Approach                Recall  QPS
      LuceneVectorsOnly()     1.000      6.722
      
      LuceneHnsw(n_cands=10)  0.507   5036.236
      LuceneHnsw(n_cands=50)  0.760   2099.850
      LuceneHnsw(n_cands=100) 0.833   1233.690
      LuceneHnsw(n_cands=500) 0.941    309.077
      LuceneHnsw(n_cands=800) 0.961    203.782
      
      hnswlib(n_cands=10)     0.597  43543.345
      hnswlib(n_cands=50)     0.832  14719.165
      hnswlib(n_cands=100)    0.897   8299.948
      hnswlib(n_cands=500)    0.981   1931.985
      hnswlib(n_cands=800)    0.991    881.752
      

      Notes on benchmark:

      • By default, the ann-benchmarks repo retrieves 10 nearest neighbors and computes the recall against the true neighbors. The recall calculation has a small 'fudge factor' that allows neighbors that are within a small epsilon of the best distance. Queries are executed serially to obtain the QPS.
      • I chose parameters where hnswlib performed well, then passed these same parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and beamWidth as efConstruction. For search parameters, I set k to k, and fanout as (num_cands - k) so that the beam search is of size num_cands. Note that our default value for beamWidth is 16, which is really low – I wasn’t able to obtain acceptable recall until I bumped it to closer to 500 to match the hnswlib default.
      • I force-merged to one segment before running searches since this gives the best recall + QPS, and also to match hnswlib.

      Some observations:

      • It'd be really nice to extend luceneutil to measure vector search recall in addition to latency. That would help ensure we’re benchmarking a more realistic scenario, instead of accidentally indexing/ searching at a very low recall. Tracking recall would also guard against subtle, unintentional changes to the algorithm. It's easy to make an accidental change while refactoring, and with approximate algorithms, unit tests don't guard as well against this.
      • Lucene HNSW gives a great speed-up over the baseline without sacrificing too much recall. But it doesn't perform as well as hnswlib in terms of both recall and QPS. We wouldn’t expect the results to line up perfectly, since Lucene doesn't actually implement HNSW – the current algorithm isn't actually hierarchical and only uses a single graph layer. Does this difference indicate we're leaving performance 'on the table' by not using layers, which (I don't think) adds much index time or space? Or are there other algorithmic improvements would help close the gap?
      • Setting beamWidth to 500 really slowed down indexing. I'll open a separate issue with indexing speed results, keeping this one focused on search.

      Attachments

        Activity

          People

            Unassigned Unassigned
            julietibs Julie Tibshirani
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment