Details
-
Bug
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
Correctness
-
Normal
-
Normal
-
User Report
-
All
-
None
Description
The current vector search implementation has the following execution path within SAI:
- Get topk vectors in score order (desc) from each vector index.
- Sort in Primary Key order
- Pull all results into the VectorTopKProcessor, sort/truncate to queried LIMIT
- StorageAttachedIndexSearcher checks if there were any shadowed rows, and if there are, re-runs steps 1 through 3 until there are no additional shadowed rows.
There are several problems with this approach.
- It does not properly handle updated vectors, and can return low scoring vectors.
- It materializes `limit * num segment indexes` rows on the happy path.
- It is very sensitive to a single overwrite/tombstone and leads to timeouts in many simple use cases. This sensitivity is even worse for hybrid queries because the re-query logic forces us to search the boolean predicates again too.
I propose that we modify the execution path within SAI and decouple the SAI vector search from other search types. We can return iterators out of the vector index that are in score order, not Primary Key order. In doing so, we can create merge-able iterators that will give us a single score ordered iterator from which we can materialize one row at a time in the VectorTopKProcessor.
The solution removes the sensitivity to tombstones and properly handles updated vectors by making sure that the similarity score of the materialized vector matches the similarity score computed within the index.
I implemented this in the DataStax fork here https://github.com/datastax/cassandra/pull/929. The PR has quite a bit of detail and has been run in production for a while now, so it is hardened code. At that time, we saw hybrid query p99 latencies for some production workloads go from 2 seconds to 40 milliseconds, and for other workloads, we saw queries go from timing out to 200 milliseconds.
I have continued to develop on the design, so the work that I am proposing bringing in to SAI will be a bit of an updated version of 929.
I put together some mermaid diagrams to show how the patch affects query execution within SAI/vector: https://github.com/michaeljmarshall/cassandra/blob/vsearch/src/java/org/apache/cassandra/index/sai/VECTOR.md
Here is a sample of a test that does not work for the current vector implementation but will pass with my patch.
@Test public void testUpdateVectorToWorseAndBetterPositions() throws Throwable { createTable("CREATE TABLE %s (pk int, val vector<float, 2>, PRIMARY KEY(pk))"); createIndex("CREATE CUSTOM INDEX ON %s(val) USING 'StorageAttachedIndex'"); execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 2.0])"); execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.0])"); flush(); execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 4.0])"); beforeAndAfterFlush(() -> { assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] LIMIT 1"), row(1)); assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] LIMIT 2"), row(1), row(0)); }); // And now update pk 1 to show that we can get 0 too execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 5.0])"); beforeAndAfterFlush(() -> { assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] LIMIT 1"), row(0)); assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] LIMIT 2"), row(0), row(1)); }); // And now update both PKs so that the stream of ranked rows is PKs: 0, 1, [1], 0, 1, [0], where the numbers // wrapped in brackets are the "real" scores of the vectors. This test makes sure that we correctly remove // PrimaryKeys from the updatedKeys map so that we don't accidentally duplicate PKs. execute("INSERT INTO %s (pk, val) VALUES (1, [1.0, 3.5])"); execute("INSERT INTO %s (pk, val) VALUES (0, [1.0, 6.0])"); beforeAndAfterFlush(() -> { assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] LIMIT 1"), row(1)); assertRows(execute("SELECT pk FROM %s ORDER BY val ann of [1.0, 2.0] LIMIT 2"), row(1), row(0)); }); }
Due to the correctness issues and the high probability of timeouts, as well as the known quality of this change, I propose that this target the cassandra-5.0 branch.