Details
-
Bug
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
None
-
Cassandra 3.7
-
Normal
Description
Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4] Use HELP for help. cqlsh> use test; cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY KEY((id), clustering)); cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', 'analyzed': 'true'}; //1st example SAME PARTITION KEY cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, 'homeworker'); cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, 'hardworker'); cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%'; id | clustering | val ----+------------+------------ 1 | 1 | homeworker 1 | 2 | hardworker (2 rows) //2nd example DIFFERENT PARTITION KEY cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, 'speedrun'); cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, 'longrun'); cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%'; id | clustering | val ----+------------+--------- 11 | 1 | longrun (1 rows)
In the 1st example, both rows belong to the same partition so SASI returns both values. Indeed LIKE '%work home%' means contains 'work' OR 'home' so the result makes sense
In the 2nd example, only one row is returned whereas we expect 2 rows because LIKE '%long run%' means contains 'long' OR 'run' so speedrun should be returned too.
So where is the problem ? Explanation:
When there is only 1 predicate, the root operation type is an AND:
private Operation analyze() { try { Operation.Builder and = new Operation.Builder(OperationType.AND, controller); controller.getExpressions().forEach(and::add); return and.complete(); } ... }
During the parsing of LIKE '%long run%', SASI creates 2 expressions for the searched term: long and run, which corresponds to an OR logic. However, this piece of code just ruins the OR logic:
public Operation complete() { if (!expressions.isEmpty()) { ListMultimap<ColumnDefinition, Expression> analyzedExpressions = analyzeGroup(controller, op, expressions); RangeIterator.Builder<Long, Token> range = controller.getIndexes(op, analyzedExpressions.values()); ... }
As you can see, we blindly take all the values of the MultiMap (which contains a single entry for the val column with 2 expressions) and pass it to controller.getIndexes(...)
public RangeIterator.Builder<Long, Token> getIndexes(OperationType op, Collection<Expression> expressions) { if (resources.containsKey(expressions)) throw new IllegalArgumentException("Can't process the same expressions multiple times."); RangeIterator.Builder<Long, Token> builder = op == OperationType.OR ? RangeUnionIterator.<Long, Token>builder() : RangeIntersectionIterator.<Long, Token>builder(); ... }
And because the root operation has AND type, the RangeIntersectionIterator will be used on both expressions long and run.
So when data belong to different partitions, we have the AND logic that applies and eliminates speedrun
When data belong to the same partition but different row, the RangeIntersectionIterator returns a single partition and then the rows are filtered further by operationTree.satisfiedBy and the results are correct
while (currentKeys.hasNext()) { DecoratedKey key = currentKeys.next(); if (!keyRange.right.isMinimum() && keyRange.right.compareTo(key) < 0) return endOfData(); try (UnfilteredRowIterator partition = controller.getPartition(key, executionController)) { Row staticRow = partition.staticRow(); List<Unfiltered> clusters = new ArrayList<>(); while (partition.hasNext()) { Unfiltered row = partition.next(); if (operationTree.satisfiedBy(row, staticRow, true)) clusters.add(row); } ... }
Attachments
Issue Links
- duplicates
-
CASSANDRA-12573 SASI index. Incorrect results for '%foo%bar%'-like search pattern.
- Resolved
- is duplicated by
-
CASSANDRA-12746 Wrong search results if clustering column has SASI index
- Resolved