Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-12674

[SASI] Confusing AND/OR semantics for StandardAnalyzer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Normal
    • Resolution: Unresolved
    • None
    • Feature/SASI
    • None
    • Cassandra 3.7

    • Normal
    • 3.7

    Description

      Connected to Test Cluster at 127.0.0.1:9042.
      [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4]
      Use HELP for help.
      cqlsh> use test;
      cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY KEY((id), clustering));
      cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
          'mode': 'CONTAINS',
           'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
          'analyzed': 'true'};
      
      //1st example SAME PARTITION KEY
      cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, 'homeworker');
      cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, 'hardworker');
      cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%';
      
       id | clustering | val
      ----+------------+------------
        1 |          1 | homeworker
        1 |          2 | hardworker
      
      (2 rows)
      
      //2nd example DIFFERENT PARTITION KEY
      cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, 'speedrun');
      cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, 'longrun');
      cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%';
      
       id | clustering | val
      ----+------------+---------
       11 |          1 | longrun
      
      (1 rows)
      

      In the 1st example, both rows belong to the same partition so SASI returns both values. Indeed LIKE '%work home%' means contains 'work' OR 'home' so the result makes sense

      In the 2nd example, only one row is returned whereas we expect 2 rows because LIKE '%long run%' means contains 'long' OR 'run' so speedrun should be returned too.

      So where is the problem ? Explanation:

      When there is only 1 predicate, the root operation type is an AND:

      QueryPlan
          private Operation analyze()
          {
              try
              {
                  Operation.Builder and = new Operation.Builder(OperationType.AND, controller);
                  controller.getExpressions().forEach(and::add);
                  return and.complete();
              }
             ...
      }
      

      During the parsing of LIKE '%long run%', SASI creates 2 expressions for the searched term: long and run, which corresponds to an OR logic. However, this piece of code just ruins the OR logic:

      Operation
              public Operation complete()
              {
                  if (!expressions.isEmpty())
                  {
                      ListMultimap<ColumnDefinition, Expression> analyzedExpressions = analyzeGroup(controller, op, expressions);
                      RangeIterator.Builder<Long, Token> range = controller.getIndexes(op, analyzedExpressions.values());
           ...
      }
      

      As you can see, we blindly take all the values of the MultiMap (which contains a single entry for the val column with 2 expressions) and pass it to controller.getIndexes(...)

      QueryController
          public RangeIterator.Builder<Long, Token> getIndexes(OperationType op, Collection<Expression> expressions)
          {
              if (resources.containsKey(expressions))
                  throw new IllegalArgumentException("Can't process the same expressions multiple times.");
      
              RangeIterator.Builder<Long, Token> builder = op == OperationType.OR
                                                      ? RangeUnionIterator.<Long, Token>builder()
                                                      : RangeIntersectionIterator.<Long, Token>builder();
              ...
      }
      

      And because the root operation has AND type, the RangeIntersectionIterator will be used on both expressions long and run.

      So when data belong to different partitions, we have the AND logic that applies and eliminates speedrun

      When data belong to the same partition but different row, the RangeIntersectionIterator returns a single partition and then the rows are filtered further by operationTree.satisfiedBy and the results are correct

      QueryPlan
                  while (currentKeys.hasNext())
                      {
                          DecoratedKey key = currentKeys.next();
      
                          if (!keyRange.right.isMinimum() && keyRange.right.compareTo(key) < 0)
                              return endOfData();
      
                          try (UnfilteredRowIterator partition = controller.getPartition(key, executionController))
                          {
                              Row staticRow = partition.staticRow();
                              List<Unfiltered> clusters = new ArrayList<>();
      
                              while (partition.hasNext())
                              {
                                  Unfiltered row = partition.next();
                                  if (operationTree.satisfiedBy(row, staticRow, true))
                                      clusters.add(row);
                              }
       ...
      }
      

      /cc xedin ifesdjeen

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              doanduyhai DuyHai Doan
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: