Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10375

Speed up HNSW merge by writing combined vector data

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.1
    • None
    • New

    Description

      When merging segments together, the HNSW writer creates a VectorValues instance that gives a merged view of all the segments' VectorValues. This merged instance is used when constructing the new HNSW graph. Graph building needs random access, and the merged VectorValues support this by mapping from merged ordinals -> segments and segment ordinals.

      This mapping seems to add overhead. The nightly indexing benchmarks sometimes show substantial time in Arrays.binarySearch (used to map an ordinal to a segment): https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.

      Instead of using a merged VectorValues to create the graph, maybe we could first write all the segment vectors to a file, and use that file to build the graph.

      Attachments

        Activity

          People

            Unassigned Unassigned
            julietibs Julie Tibshirani
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7h 10m
                7h 10m