LUCENE-9322 we decided that the new vectors API shouldn’t assume a particular nearest-neighbor search data structure and algorithm. This flexibility is important since NN search is a developing area and we'd like to be able to experiment and evolve the algorithm. Right now we only have one algorithm (HNSW), but we want to maintain the ability to use another.
Currently the algorithm to use is specified through SearchStrategy, for example SearchStrategy.EUCLIDEAN_HNSW. So a single format implementation is expected to handle multiple algorithms. Instead we could have one format implementation per algorithm. Our current implementation would be HNSW-specific like HnswVectorFormat, and to experiment with another algorithm you could create a new implementation like ClusterVectorFormat. This would be better aligned with the codec framework, and help avoid exposing algorithm details in the API.
A concrete proposal (note many of these names will change when
LUCENE-9855 is addressed):
- Rename Lucene90VectorFormat to Lucene90HnswVectorFormat. Also add HNSW to name of Lucene90VectorWriter and Lucene90VectorReader.
- Remove references to HNSW in SearchStrategy, so there is just SearchStrategy.EUCLIDEAN, etc. Rename SearchStrategy to something like SimilarityFunction.
- Remove FieldType attributes related to HNSW parameters (maxConn and beamWidth). Instead make these arguments to Lucene90HnswVectorFormat.
- Introduce PerFieldVectorFormat to allow a different NN approach or parameters to be configured per-field (?)
One note: the current HNSW-based format includes logic for storing a numeric vector per document, as well as constructing + storing a HNSW graph. When adding another implementation, it’d be nice to be able to reuse logic for reading/ writing numeric vectors. I don’t think we need to design for this right now, but we can keep it in mind for the future?
This issue is based on a thread jpountz started: https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E