Affects Version/s: None
Fix Version/s: None
Search engines have traditionally relied upon token-based matching (typically keywords) on an inverted index, plus relevance ranking based upon keyword occurrence statistics. This can be viewed as a "sparse vector” match (where each term is a one-hot encoded dimension in the vector), since only a few keywords out of all possible keywords are considered in each query. With the introduction of deep-learning-based transformers over the last few years, however, the state of the art in relevance has moved to ranking models based upon dense vectors that encode a latent, semantic understanding of both language constructs and the underlying domain upon which the model was trained. These dense vectors are also referred to as “embeddings”. An example of this kind of embedding would be taking the phrase “chief executive officer of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
. Other similar phrases should encode to vectors with very similar numbers, so we may expect a query like “CEO of a technology org” to generate a vector like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity calculation between these vectors, we would expect a number closer to 1.0, whereas a very unrelated text blurb would generate a much smaller cosine similarity.
This is a proposal for how we should implement these vector search capabilities in Solr.
In order to implement dense vector search, the following process is typically followed:
An encoder is built. An encoder can take in text (a query, a sentence, a paragraph, a document, etc.) and return a dense vector representing that document in a rich semantic space. The semantic space is learned from training on textual data (usually, though other sources work, too), typically from the domain of the search engine.
When documents are processed, they are passed to the encoder, and the dense vector(s) returned are stored as fields on the document. There could be one or more vectors per-document, as the granularity of the vectors could be per-document, per field, per paragraph, per-sentence, or even per phrase or per term.
Encoding: The query is translated to a dense vector by passing it to the encoder
Quantization: The query is quantized. Quantization is the process of taking a vector with many values and turning it into “terms” in a vector space that approximates the full vector space of the dense vectors.
ANN Matching: A query on the quantized vector tokens is executed as an ANN (approximate nearest neighbor) search. This allows finding most of the best matching documents (typically up to 95%) with a traditional and efficient lookup against the inverted index.
(optional) ANN Ranking: ranking may be performed based upon the matched quantized tokens to get a rough, initial ranking of documents based upon the similarity of the query and document vectors. This allows the next step (re-ranking) to be performed on a smaller subset of documents.
Re-Ranking: Once the initial matching (and optionally ANN ranking) is performed, a similarity calculation (cosine, dot-product, or any number of other calculations) is typically performed between the full (non-quantized) dense vectors for the query and those in the document. This re-ranking will typically be on the top-N results for performance reasons.
Return Results: As with any search, the final step is typically to return the results in relevance-ranked order. In this case, that would be sorted by the re-ranking similarity score (i.e. “cosine descending”).
Variant: For small document sets, it may be preferable to rank all documents and skip steps steps 2, 3, and 4. This is because ANN Matching typically reduces recall (current state of the art is around 95% recall), so it can be beneficial to rank all documents if performance is not a concern. In this case, step 5 is performed on the full doc set and would obviously just be considered “ranking” instead of “re-”ranking.
We will add a new dense vector field type in Solr. This field type would be a compressed encoding of a dense vector into a BinaryDocValues Field. There are other ways to do it, but this is almost certain to be the most efficient.
Ideally this field is multi-valued. If it is single-valued then we are either limited to only document-level vectors, or otherwise we have to create many vector fields (i.e. per paragraph or term) and search across them, which will never be practical or scale well. BinaryDocValues does not natively support multiple values, but since the Solr field type will need to be responsible for encoding the numeric vectors as a byte it should also be able to encode multiple vectors at the same time.
Implement function queries (value sources) that take in a query vector, a (multi-valued) vector field name, and a multi-value selector ("max", "min" ,"avg", "first"), and return the computed calculation. This function could then be used to score documents via the “func” query parser. ANN queries could be implemented using separate “fq” params for filtering, or as part of the “q” query parameter to support optional ANN Ranking, and then the existing Solr re-ranking functionality could be used for re-ranking the top N results using the “func” query parser and these new cosine or dot product value sources as the re-rank query. Existing Solr logic for performing two-phase iteration will ensure that these vector functions will not be excessively computed for documents that do not already match the “cheaper” ANN queries.
^ simple, no-ANN. Third param is optional and is the function (min|max|avg|first|last). I think this should default to "max" (or whatever "most related" is for each similarity function. It's "max" for cosine and dot product).
^ ensures function is not computed unless ANN query matches due to two-phase iterator
^Supports an initial ranking based upon the ANN QUERY, followed by a re-ranking based upon the cosine function.
The key implementation challenges are expected to be:
- ensuring decent query (speed) performance
- Determining an efficient vector/byte encoding that has decent disk-space/query time decoding trade offs and makes it possible to decode only the ‘first’ vector per doc in cases where that is all that’s requested by the distance function
- Implement pluggable quantization/ANN search. There are multiple different algorithms to do quantization for efficient ANN search. The simplest is something like LSH, with more advanced implementations being worked on in Lucene right now in LUCENE-9004 (HNSW) and LUCENE-9136 (IVFFLAT), and with a prototyped LSH implementation (Solr Vector Search plugin) and a random projections approach (Hangry) linked to in SOLR-12890.
- Given that the transformers/encoders are a rapidly evolving area of data science and NLP research, it feels unreasonable to assume that Solr will be able to keep up to date on the latest model integrations. Given that, if we build support into Solr for encoder models, we need to make it very pluggable.
- Phase 1 fully enables quantization to be done OUTSIDE of Solr, and thus for external systems to accomplish quantization by creating quantized terms externally, indexing them to Solr, and then sending in queries to Solr with the quantized terms. I’m not convinced that keeping quantization outside of Lucene/Solr isn’t a more manageable model long term, but if certain models do get integrated natively into Lucene (as is currently being worked on) then we should obviously look to leverage them.
- Multi-valued dense vectors are a key differentiating feature here. I’m considering this a key requirement and designing accordingly. This will allow for multiple embeddings per document (i.e. “word embeddings, sentence embeddings, paragraph embeddings, etc.) to be modeled. I like the way Erik Hatcher designed the the Payload Function Query and think we should model the multi-valued handling similarly.
- As best as possible, it would be nice to support multiple (pluggable) fields/value sources that can be converted to dense vectors and compared. At a minimum we need an efficient dense vector field, but it may be beneficial to support a sparse vector field where only meaningful values need to be encoded, in order to save space. Additionally, there may be some utility in other field types or arbitrary value sources that can product vector-like output to be able to be leveraged in the function queries being proposed.
- There are a handful of ways to perform vector search today with Solr. One is through Streaming Expressions, and alternatively there are a few plugins which have implemented this functionality for Solr. One of these was contributed in SOLR-12890, and I've outlined examples of most of these approaches in the comments on SOLR-12890.
- The Streaming Expressions approach works well out of the box, but doesn't integrate nicely with traditional keyword search use cases, so we need a solution that integrates with real-time queries and enables the full suite of Solr's query-time features.
- The most advanced Solr plugin out there seems to be this one: https://github.com/moshebla/solr-vector-scoring (forked the version in SOLR-12890 and then added LSH functionality for ANN).
- I encountered several challenges when working with the implementation in SOLR-12890 which informed the above proposed design:
1. The plugin encodes dense vectors into payloads attached to terms representing vector positions. This is pretty inefficient and is way too slow at significant scale. I believe we instead need the underlying field types to be binary doc values field with efficient compression and decoding for optimal performance when scanning through all the matching documents and dimensions.
2. The plugin only supports one vector per-field, making supporting word/phrase vectors, sentence vectors, paragraph vectors, etc. challenging. It's not obvious how to modify the current approach to support multiple vectors per field.
3. The plugin uses a query parser instead of just a function query, which prevents re-use of the vector similarity function. Using a function query instead will enable the vector similarity scores to be easily leveraged for relevance scoring, sorting, returned as fields, or used within other function queries. Given that there are two goals of 1) filtering on ANN, and 2) Scoring on vector similarity (usually via re-ranking), using a function query for the scoring piece will be more flexible across use scoring use cases (combining with other functions, returning, sorting, etc.)
4. The LSH quantization/ANN implementation is a cool idea, but it is hard-coded in the implementation. I recommend making it a separate filter and making the implementation more pluggable (though for now this can be externalized and passed in to Solr). We may be able to pull the LSH work in during phase 2.
For cleaner discussion and tracking, I'm moving this revised proposal to this new JIRA, and we can use SOLR-12890 for continue discussion (if any) of the previously contributed plugin implementation.